Job Description

About T-Mobile:

T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.


About TMUS Global Solutions:

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited operates as TMUS Global Solutions.


This role is essential for maintaining and improving the reliability and resilience of digital infrastructure systems. It primarily involves automating processes, monitoring system health, and managing incident responses to reduce operational disruptions. The role requires proficiency in programming, scripting, and incident management to support system stability and efficiency. Success is measured by system uptime, reduction in manual interventions, and rapid recovery from incidents. The work directly supports organizational service quality and operational performance by ensuring robust and reliable digital operations.


The Site Reliability Engineer is the first line of defense in monitoring, triaging, and executing standardized operational tasks for all enterprise applications running on standard patterns and platforms like Kubernetes, APIs, WAF, databases, API Proxy (Gloo, APIGEE), Kafka, and Cloud (AWS/Azure/GCP). They will follow runbooks, leverage automation, and escalate appropriately to minimize downtime.


Responsibilities:

  • Monitor system health, alerts, dashboards, and logs across cloud and on-prem infrastructure.
  • Ability to isolate functional issue with application versus platform
  • Execute standardized runbooks for incident resolution, deployments, and routine tasks.
  • Perform initial triage of incidents and escalate to Sr. Engineer/ Principal Engineer as needed to mitigate the issue to get to bypass.
  • Document new issues, gaps in runbooks, and automation opportunities.
  • Provide excellent communication to stakeholders during incidents.
  • Support onboarding of new applications into the operations framework.


Skills Mandatory Skills (Must-Have)

System & Infrastructure Monitoring:

  • Expectation: Ability to use monitoring dashboards (e.g., Grafana, Datadog, Splunk, Argos, AIOps) to identify anomalies, follow alert workflows, and escalate when thresholds are breached.
  • Example: When a Kubernetes pod crash-loop is flagged in Prometheus, Engineer should validate it against runbooks, check pod logs, and escalate if restart attempts fail.


Runbook Execution:

  • Expectation: Strictly follow documented steps to resolve standard incidents, escalate when steps do not apply or fail.
  • Example: Use a provided runbook to restart a failed API proxy service; if error persists beyond documented steps, escalate to SR ENGINEER/ PRINCIPAL ENGINEER.


Incident Triage & Communication:

  • Expectation: Perform first-line triage of alerts, gather logs/metrics, categorize severity, and notify stakeholders in clear, concise language.
  • Example: For a database connection timeout, collect error logs, verify service reachability, and provide a detailed incident note to SR ENGINEER/ PRINCIPAL ENGINEER before escalation.


Kubernetes (Cloud or onprem) operations knowledge:

  • Expectation: Ability to check pod status, understand logs, and verify service endpoints using kubectl and monitoring tools.
  • Example: Run kubectl get pods -n to verify if deployments are healthy.


Scripting (Python, Bash, PowerShell):

  • Expectation: Able to read and make small edits to scripts to automate repetitive checks.
  • Example: Modify a Bash script to include an additional log path in a health check.


Networking & Security Awareness:

  • Expectation: Understand troubleshooting (ping, netstat, curl, traceroute) and know when issues may be related to firewall, WAF, or proxy.
  • Example: For an unreachable service, confirm DNS resolution and connectivity before escalating to SR ENGINEER/ PRINCIPAL ENGINEER.


Documentation & Knowledge Capture:

  • Expectation: Accurately record steps taken during incidents, suggest runbook updates where gaps exist.
  • Example: After handling an alert for disk usage, note missing cleanup steps in the runbook and flag for update.


Preferred Skills (Nice-to-Have):

Cloud Platform Familiarity (AWS, Azure, GCP):

  • Expectation: Understand basics of cloud services (VMs, load balancers, storage) and how to navigate a cloud console.
  • Example: Use AWS Console to check EC2 instance health status when a service alert is triggered.


Database Basics (SQL/NoSQL):

  • Expectation: Run simple queries to validate DB connectivity and health.
  • Example: Execute SELECT 1; to verify a database is reachable.


Automation & Self-Service Mindset:

  • Expectation: Identify repetitive manual steps and propose candidates for automation.
  • Example: Flag that manual log collection during outages could be replaced with a script.
  • Exposure to Incident Management Tools (xMatters, ServiceNow, Jira, etc.)
  • Expectation: Comfortable working within ITSM/incident workflows.
  • Example: Log incident details in ServiceNow with accurate categorization and timestamps.


AI/Chatbot-Assisted Ops (emerging skill):

  • Expectation: Use AI assistants to search runbooks or suggest remediation steps.
  • Example: Ask an AI ops assistant to summarize logs before escalation.


Qualifications:

  • 5+ years in IT operations, NOC, or SRE/DevOps engineer role.
  • Kubernetes 101, Linux 101, Networking 101
  • Understanding of cloud-ready applications
  • Understanding of observability tools (Prometheus, Grafana, ELK, Splunk, etc.).
  • Strong troubleshooting mindset, ability to follow structured workflows. Eg: 5 Whys and Fishbone


TMUS India Private Limited, operating as TMUS Global Solutions, has engaged ANSR, Inc. (\"ANSR\") as its exclusive recruiting partner. That means that any communications regarding TMUS Global Solutions opportunities or employment offers will be issued only through ANSR and the 1Recruit platform. If you receive a communication or offer from another individual or entity, please notify TMUS Global Solutions immediately.

TMUS Global Solutions will never seek any payment or other compensation during the hiring process or request sensitive personal data (such as bank details or government-issued identification numbers) prior to a candidate’s acceptance of a formal offer.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application