Devops / MLOps and Platform Engineer

📍 bangalore, bangalore, India
Full-time Other-General Posted January 16, 2026
Apply Now Similar Jobs
Job Description

Job Description  

Software Engineer (SDE-2) – DevOps, SRE & MLOps Platform Engineering  
Location:  Bengaluru 
Employment Type:  Full-time 
Team:  Platform Engineering / Reliability 

About Blue Machines  

Blue Machines  powers large-scale, real-time Voice AI platforms and Agentic Workflows  for global enterprises  across BFSI, Healthcare, HRTech and customer experience domains. 
Built and scaled from India, our platform has processed 14.5M+ minutes of production-grade AI agent conversations , operating latency-sensitive, always-on voice systems  across geographies. 

About the Role  

We are hiring a hands-on DevOps / SRE engineer  who owns platform reliability, observability and automation  and grows into MLOps and AI platform engineering . 
This role focuses on designing, operating and evolving  the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale , driving uptime, performance and resilience. 

Key Responsibilities  

Platform Reliability & SRE  

Own 99.9%+ platform uptime  for real-time Voice AI workloads. 
Participate in on-call rotations , incident response and post-incident reviews. 
Lead root cause analysis (RCA)  and drive permanent reliability improvements. 
Design and implement self-healing systems  using automation, retries, circuit breakers and failover strategies. 

Kubernetes & Cloud Infrastructure  

Design, operate and scale Kubernetes clusters  in public cloud environments. 
Work with managed Kubernetes platforms such as GKE , and apply cloud-native best practices. 
Implement auto-scaling strategies  (HPA, VPA, node pools, GPU workloads). 
Manage infrastructure using Infrastructure as Code (Terraform) . 
Optimize infrastructure for performance, reliability and cost efficiency . 

Observability & Incident Intelligence  

Build and maintain monitoring, logging and alerting systems  using Prometheus, Grafana, Loki and OpenTelemetry . 
Define SLIs, SLOs and error budgets  for platform and AI workloads. 
Drive signal-based alerting  to reduce noise and improve response quality. 
Implement anomaly detection and predictive alerting  for infrastructure and AI pipelines. 

CI/CD & Platform Automation  

Design and maintain CI/CD pipelines  for services and infrastructure. 
Build internal automation tooling  for: 
Progressive and canary deployments 
Auto-scaling and capacity planning 
Faster incident diagnosis and recovery 
Enable self-service DevOps workflows  for engineering teams. 

MLOps & AI Platform Reliability  

Own reliability and performance of STT, TTS and LLM inference pipelines . 
Design provider routing, failover and SLA enforcement  mechanisms. 
Deploy, version and roll back AI models and inference services . 
Monitor inference latency, quality and drift in production systems. 
Operate GPU-backed inference workloads  where applicable. 

Security, Compliance & Resilience  

Enforce DevSecOps practices  across build and deploy pipelines. 
Implement network policies, encryption, secrets management and access controls . 
Drive disaster recovery, backup strategies and resilience testing . 
Contribute to SOC2 / ISO compliance and audits . 

Collaboration & Engineering Excellence  

Partner with backend, AI and platform teams  on architecture and reliability. 
Influence system design through a reliability-first mindset . 
Mentor junior engineers and raise the overall bar for operational excellence. 

Qualifications  

Must-Have  

3–6 years  of experience in DevOps, SRE or Platform Engineering  roles. 
Strong hands-on experience with Kubernetes and Docker  in production environments. 
Familiarity with public cloud platforms  and managed Kubernetes services (such as GKE) . 
Strong understanding of distributed systems and production debugging . 
Hands-on experience with observability systems . 
Proficiency with Infrastructure as Code (Terraform) . 
Strong incident ownership and communication skills. 

Good-to-Have  

Experience with MLOps or AI inference platforms . 
Familiarity with LLM pipelines, real-time streaming or telephony systems . 
Experience operating GPU workloads . 
Knowledge of AIOps, anomaly detection or intelligent alerting . 
Cloud cost optimization  experience. 

Why Blue Machines  

Build global-scale AI infrastructure from India . 
Operate real-time Voice AI systems  with 14.5M+ minutes in production . 
Work on low-latency, high-reliability platforms . 
Grow from DevOps/SRE into MLOps and AI platform engineering . 
High ownership, deep technical impact and real production scale 
Apply for this Position

Ready to join ? Click the button below to submit your application.
Submit Application
Job Details

Location
bangalore, bangalore, India
Job Type
Full-time