Job Description
Responsibilities
- Apply SRE principles to ensure the reliability, availability, scalability, and performance of production systems
- Design, implement, and maintain automation and Infrastructure as Code to reduce operational toil and manual intervention
- Operate and Optimize services in AWS and containerized environments (EKS/ECS)
- Ensure platform aligns with compliance requirements
- Build and operate CI/CD pipelines using Gitlab
- Define, and implement Service Level Objectives (SLOs), and error budgets
- Implement and maintain observability solutions including metrics, logs, and traces to proactively detect and diagnose system issues
- Contribute to incident response, including triage, mitigation, root cause analysis (RCA), and post-incident reviews
- Identify systemic reliability risks, performance bottlenecks, and capacity constraints; collaborate with the team to address them
- Work closely with de...
Apply for this Position
Ready to join NA+1? Click the button below to submit your application.
Submit Application