Job Description

  • 5+ years in observability, monitoring, or reliability engineering roles.
  • Hands-on experience with common observability tools such as Prometheus, Grafana, Splunk, Coralogix, and external monitoring tools (e.g., Catchpoint, ThousandEyes).
  • Strong scripting skills in Python, plus Bash or PowerShell for automation.
  • Experience with Terraform and Ansible for infrastructure automation.
  • Solid understanding of SLIs, SLOs, error budgets, and reliability engineering principles.
  • Familiarity with Linux environments and distributed systems.
  • Design and implement a Universal Dashboard in Grafana for leadership and engineering visibility.
  • Ensure a consistent look and feel across all observability views.
  • Define and implement SLIs, SLOs, and error budgets for critical services.
  • Establish alerting thresholds and escalation workflows aligned with reliability goals.
  • Integrate anomaly detection and AI-assisted insights into the observability platform.
  • Contribute to self-healing workflows and automated remediation strategies.
  • Partner with engineering teams to instrument services with metrics, logs, and traces.
  • Provide documentation and best practices for observability adoption across teams.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application