QualificationBachelor’s (B.Tech / B.E. / B.Sc.) in CS, IT, or related field
Job Description
As an SRE Intern, your primary responsibility will be monitoring the availability and performance of LeadSquared’s production SaaS infrastructure. You will combine proactive observability, meticulous capacity planning, and rapid incident management to ensure top-tier reliability. Working in a fast-paced environment, you will manage AWS-native services like EC2, RDS, Lambda, and Elasticsearch, implementing advanced alerting mechanisms to catch issues before they impact customers.
A key element of this role is end-to-end incident ownership. You will engage in emergency response, document structured Root Cause Analyses (RCAs), and track Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). This position includes on-call rotations, demanding strong problem-solving capabilities and the agility to debug live production anomalies efficiently.
Key Responsibilities
Monitor the availability and performance of a multi-tenant SaaS infrastructure hosted entirely on AWS.
Own incident management from emergency response and mitigation to publishing structured RCA reports.
Manage, monitor, and scale critical AWS services including EC2, RDS, ECS, Redis, SQS, and API Gateway.
Operate industry-standard observability tools like NewRelic, Grafana, Kibana, Site24x7, and PagerDuty.
Gather and analyze performance metrics across the OS, database, API, and backend application layers to eliminate bottlenecks.
Collaborate closely with DevOps, InfoSec, and Engineering squads to implement preventive actions and improve system operability.
Support on-call rotations to ensure immediate response to off-hours production incidents.
Skills & Eligibility
Education: Full-time Bachelor’s degree (B.Tech / B.E. / B.Sc.) in Computer Science, IT, or a related engineering discipline.
Experience: 0.5 to 1 year of hands-on experience in an SRE or cloud infrastructure role (preferably on AWS).
Cloud Knowledge: Strong grasp of multi-tenant SaaS architectures and native AWS environments. AWS or ITIL certifications are highly preferred.
Monitoring Tools: Practical exposure to observability platforms (NewRelic, Grafana, ELK stack, etc.) and incident alerting tools (PagerDuty).
Scripting: Ability to write scripts in Python (or equivalent) to automate monitoring tasks and incident responses.
Mindset: Excellent problem-solving skills, rigorous documentation discipline for RCAs, and the ability to thrive under pressure during live outages.
Note: This job is posted on external sites. Joblit shares the listing for convenience and does not take responsibility for third-party content.