Job Description

Senior Site Reliability Engineer (SRE)

Company: Pocket FM

About the Role

Pocket FM is a global audio entertainment platform serving millions of listeners across multiple geographies. We are looking for an experienced Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our large-scale audio streaming platform built on Kubernetes-first, cloud-native architecture .

In this role, you will own platform stability, improve operational excellence, and work closely with engineering teams to deliver a seamless listening experience to users worldwide.

Key Responsibilities

Reliability & Engineering Excellence

  • Own and improve the reliability, availability, and performance of globally distributed, Kubernetes-based production systems .
  • Define and continuously improve SLIs, SLOs, and SLAs using metrics derived from Prometheus and Grafana .
  • Drive reliability best practices across the entire software development lifecycle.

Kubernetes & Platform Operations

  • Operate and scale production-grade Kubernetes clusters (EKS/GKE) running critical audio streaming and backend services.
  • Troubleshoot complex production issues across pods, nodes, networking, storage, and the Kubernetes control plane.
  • Implement autoscaling, rollout strategies, and resilience patterns for containerized workloads.

CI/CD & GitOps

  • Own and improve CI/CD pipelines using GitHub Actions and Jenkins to ensure safe, reliable, and repeatable deployments.
  • Implement and operate GitOps workflows using Argo CD for Kubernetes application and configuration management.
  • Enforce deployment best practices including canary, blue-green, and rollback strategies.

Observability & Monitoring

  • Build and maintain a strong observability stack using Prometheus (metrics), Grafana (visualization), and Loki (logs) .
  • Design effective alerting strategies that reduce noise and improve signal quality.
  • Use observability insights to drive performance tuning, capacity planning, and reliability improvements.

Incident Management & Operational Excellence

  • Lead and participate in incident response for platform, Kubernetes, and database-related issues.
  • Perform post-incident reviews (PIRs) with clear root cause analysis and preventive actions.
  • Improve on-call readiness, runbooks, and operational maturity for 24x7 global systems .

Databases & State Management

  • Support and improve reliability of MySQL in production, including monitoring, backups, failover, and performance tuning.
  • Collaborate with backend teams on schema changes, query performance, and scaling strategies.

Infrastructure & Automation

  • Design and manage cloud infrastructure integrated with Kubernetes using Infrastructure-as-Code (Terraform) .
  • Automate operational tasks using Python and/or Go to reduce toil and improve system resilience.
  • Drive cost and capacity optimization across cloud and Kubernetes environments.

Collaboration & Innovation

  • Work closely with backend, mobile, data, product, and QA teams to embed reliability principles early.
  • Contribute to Pocket FM’s engineering roadmap with focus on scale, resilience, and operational efficiency .
  • Apply modern SRE and cloud-native best practices pragmatically in production.

Required Skills & Experience

Experience

  • 3+ years of experience in Site Reliability Engineering or platform engineering roles .
  • Proven experience operating large-scale, Kubernetes-based, consumer-facing systems .

Technical Expertise (Must-Have)

  • Strong hands-on expertise with Kubernetes in production environments.
  • Experience with Prometheus, Grafana, and Loki for monitoring, alerting, and logging.
  • Strong experience with CI/CD systems such as GitHub Actions and Jenkins .
  • Hands-on experience with GitOps workflows using Argo CD .
  • Solid experience managing and supporting MySQL in production.
  • Strong experience with AWS and/or GCP .
  • Proficiency in Python and/or Go .
  • Strong Infrastructure-as-Code experience using Terraform .
  • Solid understanding of Linux, networking, and cloud security fundamentals.

Preferred Qualifications

  • Kubernetes certifications (CKA / CKAD / CKS ).
  • Cloud certifications (AWS / GCP).
  • Experience supporting platforms with millions of users across multiple regions .
  • Familiarity with structured incident management practices.

Why Pocket FM?

Pocket FM is a global product with a rapidly growing international user base , offering the opportunity to work deeply across Kubernetes, observability, and GitOps while solving complex reliability challenges at scale.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application