Job Description
We are seeking a Site Reliability Engineer (SRE) to drive reliability, availability, performance, and scalability across our production application stack. This role owns the health and uptime of our mission-critical SaaS platform, cloud infrastructure, database systems, and observability ecosystem.
This role also actively invites engineers with a strong software development background who are keen to transition into Site Reliability Engineering and work closely with production systems, platform reliability, and operational excellence.
Key Responsibilities
- Ensure high availability, performance, and scalability of production applications and services
- Own incident management , including on-call rotations, root cause analysis, and post-incident reviews
- Manage and operate MS SQL Server clusters , high-availability configurations, and performance tuning
- Design, implement, and continuously improve monitoring, alerting, and observability using tools such as New Relic, Logz.io, and AWS CloudWatch
- Proactively identify system bottlenecks and drive automation and reliability improvements
- Define, measure, and improve SLOs, SLAs, and error budgets across services
- Drive disaster recovery planning, testing, and availability simulations
- Collaborate closely with application engineering teams to improve production readiness, performance, and reliability through code-level insights and tooling
- Partner with CloudOps and DevOps teams on infrastructure automation and platform enhancements
- Use Jira and Jira Service Management to manage incidents, changes, and operational workflows
Qualifications & Experience
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 3–5 years of experience in Site Reliability Engineering, CloudOps, DevOps, Platform Engineering, or related roles
- Backend or platform software development experience with exposure to production systems, reliability, or operations is a strong plus
Must Have Skills:
- Certifications in AWS, Microsoft, Windows, SQL Server, or SRE disciplines .
- Exposure to New Relic APM, IaC automation is a plus.
- Experience working in a 24x7 on-call rotation .
- Strong knowledge of Windows OS eco-system , IIS , MS SQL Server administration, clustering, performance tuning, and failover.
- Deep experience with monitoring/logging tools like New Relic, Logz.io, AWS CloudWatch .
- Experience with AWS (EC2, ASG, CloudWatch, CloudTrail, VPC) and infrastructure management.
- Good understanding of networking , DNS , load balancing , and security principles .
- Proficient in scripting languages such as PowerShell, Python .
- Strong understanding of incident response, change management, postmortem culture .
- Experience using Jira and Jira Service Management for operational workflows.
- Ability to work independently and drive technical initiatives.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application