Job Description

Role Summary
We are looking for a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our cloud-native infrastructure. The ideal candidate will bring strong hands-on experience in AWS, Kubernetes, Docker, CI/CD pipelines, monitoring, and automation using Python , and will work closely with development and operations teams to build resilient, highly available systems.
Key Responsibilities
Design, deploy, and maintain highly available and scalable systems on AWS
Manage and operate containerized applications using Docker and Kubernetes (EKS)
Build, maintain, and optimize CI/CD pipelines using Jenkins
Automate operational workflows and routine tasks using Python scripting
Implement and manage monitoring, alerting, and observability using Grafana and Prometheus
Ensure system reliability, performance, uptime, and scalability
Participate in incident response , root cause analysis (RCA), and post-incident reviews
Implement Infrastructure as Code (Ia C) and automation best practices
Collaborate with development teams to improve system architecture and deployment strategies
Enforce security, compliance, and operational best practices in cloud environments
Continuously improve system efficiency through automation, tooling, and process optimization
Required Skills & Qualifications
Strong hands-on experience with AWS services (EC2, S3, IAM, VPC, RDS, EKS, etc.)
Solid experience with Kubernetes (EKS) and Docker
Proficiency in Python scripting for automation and monitoring
Experience designing and managing CI/CD pipelines using Jenkins
Strong understanding of Dev Ops principles and CI/CD best practices
Hands-on experience with Grafana and Prometheus for monitoring and alerting
Strong knowledge of Linux systems and networking fundamentals
Experience with Git or other version control systems
Understanding of microservices architecture
Good to Have
Experience with Terraform or Cloud Formation
Knowledge of Helm, Argo CD, or similar deployment tools
Familiarity with log management tools (ELK / EFK stack)
Understanding of SRE practices such as SLIs, SLOs, SLAs, and error budgets
AWS and/or Kubernetes certifications (CKA / CKAD)

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application