Job Description
JOB DESCRIPTION Operate, monitor, and maintain the infrastructure supporting GenAI applications ( training, inference, feature store, data ingestion, model serving) Design and build automation for core platform capabilities, reducing manual toil Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc. Establish, monitor and enforce SLOs/SLIs/LSAs, error budgets, alerting, and dashboards Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation Perform capacity planning, scaling strategies, workload sche...
Req ID:
We are currently seeking a AI Platform Site Reliability Engineering Specialist to join our team in Bengaluru, Karnātaka (IN-KA), India (IN).
What you'll do in the role:
Below is a sample of potential responsibilities depending on product/focus area:
Apply for this Position
Ready to join NTT? Click the button below to submit your application.
Submit Application