Job Description

MANTECH seeks motivated, career, and customer-oriented **Site Reliability Engineer (SRE)** for a new initiative. This effort supports the rapid design, deployment, operation, and sustainment of enterprise-scale AI, data, and mission platform capabilities across cloud, edge, and classified operational environment

This role supports the operational reliability, scalability, monitoring, and incident response for the enterprise AI systems. You will focus on operational outcomes and optimizing system performance.

**Responsibilities include but are not limited to:**

+ Apply core reliability engineering principles to ensure high availability and stability of production systems.
+ Manage incident response, root cause analysis, and post-mortem processes for the AI platform.
+ Implement and optimize observability operations using OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.
+ Oversee capacity planning, performance optimization, and FinOps practices.

Apply for this Position

Ready to join ManTech? Click the button below to submit your application.

Submit Application