Job Description

We are seeking a Site Reliability Engineer (SRE) to drive reliability, availability, performance, and scalability across our production application stack. This role owns the health and uptime of our mission-critical SaaS platform, cloud infrastructure, database systems, and observability ecosystem.


This role also actively invites engineers with a strong software development background who are keen to transition into Site Reliability Engineering and work closely with production systems, platform reliability, and operational excellence.


Key Responsibilities

  • Ensure high availability, performance, and scalability of production applications and services
  • Own incident management , including on-call rotations, root cause analysis, and post-incident reviews
  • Manage and operate MS SQL Server clusters , high-availability configurations, and performance tuning
  • Design, implement, and continuously improve monitoring, alerting, and observability using tools such as New Relic, Logz.io, and AWS CloudWatch
  • Proactively identify system bottlenecks and drive automation and reliability improvements
  • Define, measure, and improve SLOs, SLAs, and error budgets across services
  • Drive disaster recovery planning, testing, and availability simulations
  • Collaborate closely with application engineering teams to improve production readiness, performance, and reliability through code-level insights and tooling
  • Partner with CloudOps and DevOps teams on infrastructure automation and platform enhancements
  • Use Jira and Jira Service Management to manage incidents, changes, and operational workflows


Qualifications & Experience

  • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • 3–5 years of experience in Site Reliability Engineering, CloudOps, DevOps, Platform Engineering, or related roles
  • Backend or platform software development experience with exposure to production systems, reliability, or operations is a strong plus


Must Have Skills:

  • Certifications in AWS, Microsoft, Windows, SQL Server, or SRE disciplines .
  • Exposure to New Relic APM, IaC automation is a plus.
  • Experience working in a 24x7 on-call rotation .
  • Strong knowledge of Windows OS eco-system , IIS , MS SQL Server administration, clustering, performance tuning, and failover.
  • Deep experience with monitoring/logging tools like New Relic, Logz.io, AWS CloudWatch .
  • Experience with AWS (EC2, ASG, CloudWatch, CloudTrail, VPC) and infrastructure management.
  • Good understanding of networking , DNS , load balancing , and security principles .
  • Proficient in scripting languages such as PowerShell, Python .
  • Strong understanding of incident response, change management, postmortem culture .
  • Experience using Jira and Jira Service Management for operational workflows.
  • Ability to work independently and drive technical initiatives.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application