Job Description
Job Description
As a Senior Site Reliability Engineer (SRE) , you will be responsible for the reliability, scalability, and observability of our DevOps ecosystem. This includes CI/CD systems, Kubernetes clusters, infrastructure automation, and telemetry platforms. You will work closely with development, QA, and operations teams to build resilient systems and ensure continuous improvement of reliability standards.
Key Responsibilities:
- Own and manage DevOps components and tooling across 100+ production environments.
- Administer, scale, and optimize Kubernetes clusters used for application and infrastructure workloads.
- Implement and maintain observability stacks including Prometheus, OpenTelemetry (OTel), Elasticsearch, and ClickHouse for metrics, tracing, and log analytics.
- Ensure high availability of CI/CD pipelines and automate infrastructure provisioning using Terraform and Ansible.
- Build alerting, monitoring, and dashboarding systems to proactively detect and resolve issues.
- Lead root cause analysis for incidents and drive long-term stability improvements.
- Collaborate with engineering teams to design systems that are reliable, secure, and observable by default.
- Participate in on-call rotations and lead incident response efforts when necessary.
- Advice the cloud platform team to improve the reliability of the systems in production and scale them based on need.
- Participate in the development process by supporting new features, services, releases and hold an ownership mindset for the cloud platform technologies .
- Expertise in one of the programming language: Java, Python or Go.
- Proficient in writing bash scripts.
- Good understanding of SQL and NoSQL systems.
- Good understanding of systems programming (network stack, file system, OS services) .
- Should have good handson on Ansible .
- Should be able to automate Day to day activities .
- Required Skills & Experience:
- 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
- Expertise in Kubernetes: deployment, scaling, troubleshooting, and operations in production.
- Strong Linux systems background and scripting skills (Python, Bash, or Go).
- Hands-on experience with CI/CD tools such as Jenkins, GitLab CI, or similar.
- Infrastructure-as-Code skills with tools like Terraform, Ansible, or equivalent.
- Solid knowledge of observability tools, including:
- Prometheus for monitoring and alerting
- OpenTelemetry (OTel) for tracing and telemetry
- Elasticsearch and Click House for log storage and analytics
- AppDynamics
- Experience with containerization (Docker) and orchestration at scale.
- Familiarity with cloud platforms (AWS, GCP, or Azure) and hybrid-cloud architecture.
- Ability to debug and tune system performance under production load.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application