Job Description

Experience: 8 years+


Solid understanding of Google SRE principles and practices.

Handson experience implementing SLIs, SLOs, and error budgets.

Automation experience and hands on, preferably python. (Observability as

Code).

Expertise in incident management, postmortems, and reliability improvement

cycles.

Experience with monitoring and observability tools (e.g., Prometheus, Grafana,

New Relic, Datadog, Open Telemetry).

Strong expertise in logging, tracing, and metricsbased troubleshooting.

Ability to design alerts that reflect customer and business impact.

Hands on with Linux, bash, git, CI/CD, Docker, K8S.

Experience with Infrastructure as Code (Terraform, ARM, CloudFormation,

etc.).

Familiarity with CI/CD pipelines and deployment automation.

Strong focus on eliminating toil through automation.

Good understanding on AWS cloud concepts.

Good ...

Apply for this Position

Ready to join PeopleLogic? Click the button below to submit your application.

Submit Application