Job Description
Responsibilities:
- Lead resolution of high-severity/complex incidents across hybrid infrastructure.
- Architect and implement automation frameworks, self-healing workflows, and AI-driven ops.
- Define SRE best practices, reliability SLIs/SLOs/SLAs, and operational standards.
- Partner with application and platform engineering teams to improve resilience.
- Drive observability maturity: predictive monitoring, anomaly detection, automated RCA.
- Own continuous improvement of Engineer(s)/Sr Engineer(s) runbooks and automation pipelines.
- Provide technical leadership, mentor junior SREs, and conduct training.
- Identify new technologies, tools, and processes that elevate operational excellence.
Skills:
Mandatory Skills (Must-Have):
Incident Command & Complex Troubleshooting:
- Expectation: Take leadership during high-s...
Apply for this Position
Ready to join TMUS Global Solutions? Click the button below to submit your application.
Submit Application