Job Description

Job Description


Software Engineer (SDE-2) – DevOps, SRE & MLOps Platform Engineering

Location: Bengaluru

Employment Type: Full-time

Team: Platform Engineering / Reliability


About Blue Machines


Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.

Built and scaled from India, our platform has processed 14.5M+ minutes of production-grade AI agent conversations , operating latency-sensitive, always-on voice systems across geographies.


About the Role


We are hiring a hands-on DevOps / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering .

This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale , driving uptime, performance and resilience.


Key Responsibilities


Platform Reliability & SRE


  • Own 99.9%+ platform uptime for real-time Voice AI workloads.
  • Participate in on-call rotations , incident response and post-incident reviews.
  • Lead root cause analysis (RCA) and drive permanent reliability improvements.
  • Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.


Kubernetes & Cloud Infrastructure


  • Design, operate and scale Kubernetes clusters in public cloud environments.
  • Work with managed Kubernetes platforms such as GKE , and apply cloud-native best practices.
  • Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
  • Manage infrastructure using Infrastructure as Code (Terraform) .
  • Optimize infrastructure for performance, reliability and cost efficiency .


Observability & Incident Intelligence


  • Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and OpenTelemetry .
  • Define SLIs, SLOs and error budgets for platform and AI workloads.
  • Drive signal-based alerting to reduce noise and improve response quality.
  • Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.


CI/CD & Platform Automation


  • Design and maintain CI/CD pipelines for services and infrastructure.
  • Build internal automation tooling for:
  • Progressive and canary deployments
  • Auto-scaling and capacity planning
  • Faster incident diagnosis and recovery
  • Enable self-service DevOps workflows for engineering teams.


MLOps & AI Platform Reliability


  • Own reliability and performance of STT, TTS and LLM inference pipelines .
  • Design provider routing, failover and SLA enforcement mechanisms.
  • Deploy, version and roll back AI models and inference services .
  • Monitor inference latency, quality and drift in production systems.
  • Operate GPU-backed inference workloads where applicable.


Security, Compliance & Resilience


  • Enforce DevSecOps practices across build and deploy pipelines.
  • Implement network policies, encryption, secrets management and access controls .
  • Drive disaster recovery, backup strategies and resilience testing .
  • Contribute to SOC2 / ISO compliance and audits .


Collaboration & Engineering Excellence


  • Partner with backend, AI and platform teams on architecture and reliability.
  • Influence system design through a reliability-first mindset .
  • Mentor junior engineers and raise the overall bar for operational excellence.


Qualifications


Must-Have


  • 3–6 years of experience in DevOps, SRE or Platform Engineering roles.
  • Strong hands-on experience with Kubernetes and Docker in production environments.
  • Familiarity with public cloud platforms and managed Kubernetes services (such as GKE) .
  • Strong understanding of distributed systems and production debugging .
  • Hands-on experience with observability systems .
  • Proficiency with Infrastructure as Code (Terraform) .
  • Strong incident ownership and communication skills.


Good-to-Have


  • Experience with MLOps or AI inference platforms .
  • Familiarity with LLM pipelines, real-time streaming or telephony systems .
  • Experience operating GPU workloads .
  • Knowledge of AIOps, anomaly detection or intelligent alerting .
  • Cloud cost optimization experience.


Why Blue Machines


  • Build global-scale AI infrastructure from India .
  • Operate real-time Voice AI systems with 14.5M+ minutes in production .
  • Work on low-latency, high-reliability platforms .
  • Grow from DevOps/SRE into MLOps and AI platform engineering .
  • High ownership, deep technical impact and real production scale

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application