Job Description

Role Summary

We are looking for a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our cloud-native infrastructure . The ideal candidate will bring strong hands-on experience in AWS, Kubernetes, Docker, CI/CD pipelines, monitoring, and automation using Python , and will work closely with development and operations teams to build resilient, highly available systems.

Key Responsibilities

  • Design, deploy, and maintain highly available and scalable systems on AWS
  • Manage and operate containerized applications using Docker and Kubernetes (EKS)
  • Build, maintain, and optimize CI/CD pipelines using Jenkins
  • Automate operational workflows and routine tasks using Python scripting
  • Implement and manage monitoring, alerting, and observability using Grafana and Prometheus
  • Ensure system reliability, performance, uptime, and scalability
  • Participate in incident response , root cause analysis (RCA), and post-incident reviews
  • Implement Infrastructure as Code (IaC) and automation best practices
  • Collaborate with development teams to improve system architecture and deployment strategies
  • Enforce security, compliance, and operational best practices in cloud environments
  • Continuously improve system efficiency through automation, tooling, and process optimization

Required Skills & Qualifications

  • Strong hands-on experience with AWS services (EC2, S3, IAM, VPC, RDS, EKS, etc.)
  • Solid experience with Kubernetes (EKS) and Docker
  • Proficiency in Python scripting for automation and monitoring
  • Experience designing and managing CI/CD pipelines using Jenkins
  • Strong understanding of DevOps principles and CI/CD best practices
  • Hands-on experience with Grafana and Prometheus for monitoring and alerting
  • Strong knowledge of Linux systems and networking fundamentals
  • Experience with Git or other version control systems
  • Understanding of microservices architecture

Good to Have

  • Experience with Terraform or CloudFormation
  • Knowledge of Helm, ArgoCD, or similar deployment tools
  • Familiarity with log management tools (ELK / EFK stack)
  • Understanding of SRE practices such as SLIs, SLOs, SLAs, and error budgets
  • AWS and/or Kubernetes certifications (CKA / CKAD)

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application