Job Description
We are seeking a Site Reliability Engineer (SRE) to drive reliability, availability, performance, and scalability across our production application stack. This role owns the health and uptime of our mission-critical Saa S platform, cloud infrastructure, database systems, and observability ecosystem.
This role also actively invites engineers with a strong software development background who are keen to transition into Site Reliability Engineering and work closely with production systems, platform reliability, and operational excellence.
Key Responsibilities
- Ensure high availability, performance, and scalability of production applications and services
- Own incident management, including on-call rotations, root cause analysis, and post-incident reviews
- Manage and operate MS SQL Server clusters, high-availability configurations, and performance tuning
- Design, implement, and continuously improve monitoring, alerting, and observability using tools such as New Relic, Logz.io, and AWS Cloud Watch
- Proactively identify system bottlenecks and drive automation and reliability improvements
- Define, measure, and improve SLOs, SLAs, and error budgets across services
- Drive disaster recovery planning, testing, and availability simulations
- Collaborate closely with application engineering teams to improve production readiness, performance, and reliability through code-level insights and tooling
- Partner with Cloud Ops and Dev Ops teams on infrastructure automation and platform enhancements
- Use Jira and Jira Service Management to manage incidents, changes, and operational workflows
Qualifications & Experience
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 3–5 years of experience in Site Reliability Engineering, Cloud Ops, Dev Ops, Platform Engineering, or related roles
- Backend or platform software development experience with exposure to production systems, reliability, or operations is a strong plus
Must Have Skills:
- Certifications in AWS, Microsoft, Windows, SQL Server, or SRE disciplines.
- Exposure to New Relic APM, Ia C automation is a plus.
- Experience working in a 24x7 on-call rotation.
- Strong knowledge of Windows OS eco-system, IIS, MS SQL Server administration, clustering, performance tuning, and failover.
- Deep experience with monitoring/logging tools like New Relic, Logz.io, AWS Cloud Watch.
- Experience with AWS (EC2, ASG, Cloud Watch, Cloud Trail, VPC) and infrastructure management.
- Good understanding of networking, DNS, load balancing, and security principles.
- Proficient in scripting languages such as Power Shell, Python.
- Strong understanding of incident response, change management, postmortem culture.
- Experience using Jira and Jira Service Management for operational workflows.
- Ability to work independently and drive technical initiatives.
This role also actively invites engineers with a strong software development background who are keen to transition into Site Reliability Engineering and work closely with production systems, platform reliability, and operational excellence.
Key Responsibilities
- Ensure high availability, performance, and scalability of production applications and services
- Own incident management, including on-call rotations, root cause analysis, and post-incident reviews
- Manage and operate MS SQL Server clusters, high-availability configurations, and performance tuning
- Design, implement, and continuously improve monitoring, alerting, and observability using tools such as New Relic, Logz.io, and AWS Cloud Watch
- Proactively identify system bottlenecks and drive automation and reliability improvements
- Define, measure, and improve SLOs, SLAs, and error budgets across services
- Drive disaster recovery planning, testing, and availability simulations
- Collaborate closely with application engineering teams to improve production readiness, performance, and reliability through code-level insights and tooling
- Partner with Cloud Ops and Dev Ops teams on infrastructure automation and platform enhancements
- Use Jira and Jira Service Management to manage incidents, changes, and operational workflows
Qualifications & Experience
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 3–5 years of experience in Site Reliability Engineering, Cloud Ops, Dev Ops, Platform Engineering, or related roles
- Backend or platform software development experience with exposure to production systems, reliability, or operations is a strong plus
Must Have Skills:
- Certifications in AWS, Microsoft, Windows, SQL Server, or SRE disciplines.
- Exposure to New Relic APM, Ia C automation is a plus.
- Experience working in a 24x7 on-call rotation.
- Strong knowledge of Windows OS eco-system, IIS, MS SQL Server administration, clustering, performance tuning, and failover.
- Deep experience with monitoring/logging tools like New Relic, Logz.io, AWS Cloud Watch.
- Experience with AWS (EC2, ASG, Cloud Watch, Cloud Trail, VPC) and infrastructure management.
- Good understanding of networking, DNS, load balancing, and security principles.
- Proficient in scripting languages such as Power Shell, Python.
- Strong understanding of incident response, change management, postmortem culture.
- Experience using Jira and Jira Service Management for operational workflows.
- Ability to work independently and drive technical initiatives.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application