Job Description

Responsibilities

  • Apply SRE principles to ensure the reliability, availability, scalability, and performance of production systems
  • Design, implement, and maintain automation and Infrastructure as Code to reduce operational toil and manual intervention
  • Operate and Optimize services in AWS and containerized environments (EKS/ECS)
  • Ensure platform aligns with compliance requirements
  • Build and operate CI/CD pipelines using Gitlab
  • Define, and implement Service Level Objectives (SLOs), and error budgets
  • Implement and maintain observability solutions including metrics, logs, and traces to proactively detect and diagnose system issues
  • Contribute to incident response, including triage, mitigation, root cause analysis (RCA), and post-incident reviews
  • Identify systemic reliability risks, performance bottlenecks, and capacity constraints; collaborate with the team to address them
  • Work closely with de...

Apply for this Position

Ready to join NA+1? Click the button below to submit your application.

Submit Application