Job Description
Responsibilities:
- Resolve escalated incidents across Kubernetes, API Proxy, WAF, DBs, and infra platforms.
- Design and improve runbooks, automating manual steps wherever possible.
- Lead and contribute to building self-healing systems and self-service tooling for users.
- Analyze incident trends, propose improvements in monitoring, capacity, and reliability.
- Collaborate with engineering teams on deployment, upgrades, and performance optimization.
- Conduct postmortems, document RCA, and ensure learning is captured.
- Mentor and coach Engineer(s)
Skills:
Mandatory Skills (Must-Have)
Advanced Incident Troubleshooting & Resolution:
- Expectation: Diagnose and resolve escalated incidents that Engineer(s) cannot handle, often across multiple layers (infrastructure, application, network).
- Example: For an API outage,...
Apply for this Position
Ready to join TMUS Global Solutions? Click the button below to submit your application.
Submit Application