Job Description
We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP).
This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments.
As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required.
Responsibilities
Monitoring & Observability (Core Focus)
- Own and operate the monitoring and observability stack across on-prem and GCP environments
- Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
- Define, tune, and maintain alerts to ensure high signal-to-noise ratio
- Establ...
Apply for this Position
Ready to join Devsu? Click the button below to submit your application.
Submit Application