Job Description

Responsibilities:

  • Resolve escalated incidents across Kubernetes, API Proxy, WAF, DBs, and infra platforms.
  • Design and improve runbooks, automating manual steps wherever possible.
  • Lead and contribute to building self-healing systems and self-service tooling for users.
  • Analyze incident trends, propose improvements in monitoring, capacity, and reliability.
  • Collaborate with engineering teams on deployment, upgrades, and performance optimization.
  • Conduct postmortems, document RCA, and ensure learning is captured.
  • Mentor and coach Engineer(s)

Skills:

Mandatory Skills (Must-Have)

Advanced Incident Troubleshooting & Resolution:

  • Expectation: Diagnose and resolve escalated incidents that Engineer(s) cannot handle, often across multiple layers (infrastructure, application, network).
  • Example: For an API outage,...

Apply for this Position

Ready to join TMUS Global Solutions? Click the button below to submit your application.

Submit Application