Job Description

JOB DESCRIPTION

Req ID: 354116 

We are currently seeking a AI Platform Site Reliability Engineering Specialist to join our team in Bengaluru, Karnātaka (IN-KA), India (IN).

What you'll do in the role:

Below is a sample of potential responsibilities depending on product/focus area:

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications ( training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor and enforce SLOs/SLIs/LSAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling...
  • Apply for this Position

    Ready to join NTT? Click the button below to submit your application.

    Submit Application