Job Description
- 5+ years in observability, monitoring, or reliability engineering roles.
- Hands-on experience with common observability tools such as Prometheus, Grafana, Splunk, Coralogix, and external monitoring tools (e.g., Catchpoint, ThousandEyes).
- Strong scripting skills in Python, plus Bash or PowerShell for automation.
- Experience with Terraform and Ansible for infrastructure automation.
- Solid understanding of SLIs, SLOs, error budgets, and reliability engineering principles.
- Familiarity with Linux environments and distributed systems.
- Design and implement a Universal Dashboard in Grafana for leadership and engineering visibility.
- Ensure a consistent look and feel across all observability views.
- Define and implement SLIs, SLOs, and error budgets for critical services.
- Establish alerting thresholds and escalation workflows aligned with reliability goals.
- Integrate anomaly detection and AI-assisted insights into the observability platform.
- Contribute to self-healing workflows and automated remediation strategies.
- Partner with engineering teams to instrument services with metrics, logs, and traces.
- Provide documentation and best practices for observability adoption across teams.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application