Job Description
Responsibilities:
- Lead resolution of high-severity/complex incidents across hybrid infrastructure.
- Architect and implement automation frameworks, self-healing workflows, and AI-driven ops.
- Define SRE best practices, reliability SLIs/SLOs/SLAs, and operational standards.
- Partner with application and platform engineering teams to improve resilience.
- Drive observability maturity: predictive monitoring, anomaly detection, automated RCA.
- Own continuous improvement of Engineer(s)/Sr Engineer(s) runbooks and automation pipelines.
- Provide technical leadership, mentor junior SREs, and conduct training.
- Identify new technologies, tools, and processes that elevate operational excellence.
Skills:
Mandatory Skills (Must-Have):
Incident Command & Complex Troubleshooting:
- Expectation: Take leadership during high-severity outages, orchestrating technical response across teams.
- Example: Lead a Sev-1 bridge call where multiple microservices are failing due to cascading Kubernetes issues; coordinate DB, infra, network, security and app teams to isolate the problem.
Deep Kubernetes & Distributed Systems Expertise:
- Expectation: Design, troubleshoot, and optimize complex Kubernetes clusters and multi-region deployments
- Example: Diagnose why inter-cluster communication in a service mesh is causing intermittent API failures and propose architectural fixes.
Automation Framework Design (Infra & Ops):
- Expectation: Architect automation platforms to reduce manual toil, enable self-service, and support auto-remediation.
- Example: Build an Ansible/Terraform-based automation pipeline that provisions, configures, and tests new app environments with zero manual steps.
Observability Strategy & Advanced Monitoring:
- Expectation: Define enterprise-wide observability standards (SLIs/SLOs/SLAs), implement anomaly detection, and predictive monitoring.
- Example: Roll out a metrics-based SLO framework for all API services with automated burn-rate alerts in Prometheus.
Database & Application Performance Engineering:
- Expectation: Tune databases, caching layers, and app performance to handle scale.
- Example: Identify DB query patterns that degrade API performance and recommend schema/index optimizations.
Cross-Domain SME Knowledge (Networking, Storage, APIs):
- Expectation: Act as a go-to expert across infrastructure layers.
- Example: Troubleshoot why API gateway latency spikes correlate with storage backend bottlenecks.
AI/ML in Operations (AIOps):
- Expectation: Integrate AI-driven platforms for anomaly detection, auto-remediation, and incident prediction
- Example: Deploy an ML model that predicts storage saturation 24 hours before impact, triggering automated cleanup.
Mentorship & Technical Leadership:
- Expectation: Act as SME, guiding Engineer(s)/ Sr Engineer(s), creating playbooks, and driving operational excellence.
- Example: Conduct deep-dive training sessions on advanced Kubernetes troubleshooting for Sr Engineer(s).
TMUS India Private Limited, operating as TMUS Global Solutions, has engaged ANSR, Inc. ("ANSR") as its exclusive recruiting partner. That meansthat any communications regarding TMUS Global Solutions opportunities or employment offers will be issued only through ANSR and the 1Recruit platform. If you receive a communication or offer from another individual or entity, please notify TMUS Global Solutions immediately.
TMUS Global Solutions willnever seek any payment or other compensation during the hiring process or request sensitive personal data (such as bank details or government-issued identification numbers) prior to a candidates acceptance of a formal offer.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application