Job Description

Experience: 8 years+


Solid understanding of Google SRE principles and practices.

Handson experience implementing SLIs, SLOs, and error budgets.

Automation experience and hands on, preferably python. (Observability as

Code).

Expertise in incident management, postmortems, and reliability improvement

cycles.

Experience with monitoring and observability tools (e.g., Prometheus, Grafana,

New Relic, Datadog, Open Telemetry).

Strong expertise in logging, tracing, and metricsbased troubleshooting.

Ability to design alerts that reflect customer and business impact.

Hands on with Linux, bash, git, CI/CD, Docker, K8S.

Experience with Infrastructure as Code (Terraform, ARM, CloudFormation,

etc.).

Familiarity with CI/CD pipelines and deployment automation.

Strong focus on eliminating toil through automation.

Good understanding on AWS cloud concepts.

Good SQL knowledge so that it can be useful to run NRQL.

Fair understanding of rest API and GraphQL .

Good understanding of networks, like CDN, DNS, API Gateway, Traffic routing.

Should have good understanding on BCP and Disaster Recover related

activities.

Strong analytical and troubleshooting skills under pressure.

Ability to diagnose complex production issues across multiple system layers.

Comfortable making datadriven decisions during incidents.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application