Job Description

Job Overview:

At T-Mobile , we dont just build technology we empower people. We believe in investing in YOU your growth, your impact, and your future. Were unstoppable when individuals like you come together to solve bold challenges, inspire innovation, and build platforms that serve millions.

As a Principal Site Reliability Engineer , youll join a world-class engineering team focused on building and scaling intelligent infrastructure for LLM-based applications, AI services, and enterprise-scale backend systems . Youll contribute to the design and implementation of observability, automation, and incident response strategies that ensure our platforms are high-performing, reliable, and cost-effective . Youll play a key role in driving operational excellence , supporting platform scalability, and collaborating across engineering and architecture teams. This role provides growth opportunities to influence large-scale architecture and AI/ML reliability.

Key Responsibilities:

  • Design, develop and maintain observability, monitoring, and alerting systems for AI platforms and mission-critical backend services.
  • Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools such as Splunk, Prometheus, Grafana, and OpenTelemetry.
  • Define and maintain SLOs, SLIs, and real-time health indicators across platform services and APIs.
  • Participate in on-call rotations and lead the resolution of high-impact incidents, including root cause analysis and postmortem reporting.
  • Collaborate with platform engineering teams to enforce governance, compliance, and security standards in production environments.
  • Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g., GitLab).
  • Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ, databases, and distributed APIs.
  • Support capacity planning, cost analysis, and system tuning to improve platform performance.
  • Advocate for automation-first operations, reducing manual toil through scripting and reliability tooling.
  • Create and maintain documentation, runbooks, and knowledge-sharing resources across SRE and engineering teams.
  • Mentor junior engineers and foster a culture of technical rigor and continuous improvement.

Qualifications:

  • Bachelors degree in computer science, Engineering, or a related field (Masters preferred).
  • 10+ years of experience in SRE, DevOps, or operations engineering in cloud-based environments. Overall 15+ years in Technology space.
  • Hands-on experience with monitoring, alerting, and incident response in distributed systems.
  • Strong coding and scripting skills in Python, Java, or shell scripting languages such as Bash or PowerShell.
  • Solid understanding of database principles and experience with distributed storage solutions such as Oracle, Cassandra, SOLR, and Kafka.
  • Proficiency in CI/CD pipelines and GitLab workflows.
  • Strong working knowledge of SQL and NoSQL databases, including Oracle and Cassandra.
  • Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and troubleshooting large-scale environments.
  • Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.
  • Expertise in observability tools such as Splunk, Grafana, and Prometheus.
  • Experience with Kubernetes, container orchestration, and hybrid/multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).
  • Deep understanding of security concepts and protocols, including authentication, authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.
  • Excellent knowledge of ITIL/ServiceNow terminology for incident and problem management.
  • Proven ability to work in fast-paced, incident-driven environments with high uptime requirements.

Preferred Qualifications:

  • Experience supporting AI workloads, model inference systems, or LLM-enabled platforms.
  • Exposure to AIOps or related ML platform observability and reliability practices.
  • Familiarity with LangChain, OpenAI, Spring AI, and MCP Server is a strong plus.
  • Experience in highly regulated telecom environments with compliance and audit controls.
  • Understanding of AI Gateway patterns and secure API orchestration.
  • Background in building secure, zero-downtime platforms with enterprise-scale SLAs.

Knowledge, Skills, and Abilities:

Why Join T-Mobile India?

At T-Mobile India , you wont just contribute to world-class technologyyoull help build it. Youll work with global leaders , solve complex system challenges, and build platforms that redefine how technology powers customer experience.

Were more than just a telecom companywere a technology powerhouse leading the way in AI, data, and digital innovation . And we do it all with heart, grit, and a passion for empowering people.

Join us and shape the future of intelligent platforms that serve millions at the scale and speed of T-Mobile.

TMUS India Private Limited, operating as TMUS Global Solutions, has engaged ANSR, Inc. ("ANSR") as its exclusive recruiting partner. That meansthat any communications regarding TMUS Global Solutions opportunities or employment offers will be issued only through ANSR and the 1Recruit platform. If you receive a communication or offer from another individual or entity, please notify TMUS Global Solutions immediately.

TMUS Global Solutions willnever seek any payment or other compensation during the hiring process or request sensitive personal data (such as bank details or government-issued identification numbers) before a candidate accepts a formal offer.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application