Job Description
Experience: 8 years+
Solid understanding of Google SRE principles and practices.
Handson experience implementing SLIs, SLOs, and error budgets.
Automation experience and hands on, preferably python. (Observability as
Code).
Expertise in incident management, postmortems, and reliability improvement
cycles.
Experience with monitoring and observability tools (e.g., Prometheus, Grafana,
New Relic, Datadog, Open Telemetry).
Strong expertise in logging, tracing, and metricsbased troubleshooting.
Ability to design alerts that reflect customer and business impact.
Hands on with Linux, bash, git, CI/CD, Docker, K8S.
Experience with Infrastructure as Code (Terraform, ARM, CloudFormation,
etc.).
Familiarity with CI/CD pipelines and deployment automation.
Strong focus on eliminating toil through automation.
Good understanding on AWS cloud concepts. <...
Apply for this Position
Ready to join PeopleLogic? Click the button below to submit your application.
Submit Application