Job Description
Job Description
Software Engineer (SDE-2) – Dev Ops, SRE & MLOps Platform Engineering
Location: Bengaluru
Employment Type: Full-time
Team: Platform Engineering / Reliability
About Blue Machines
Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has processed 14.5 M+ minutes of production-grade AI agent conversations, operating latency-sensitive, always-on voice systems across geographies.
About the Role
We are hiring a hands-on Dev Ops / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering.
This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale, driving uptime, performance and resilience.
Key Responsibilities
Platform Reliability & SRE
- Own 99.9%+ platform uptime for real-time Voice AI workloads.
- Participate in on-call rotations, incident response and post-incident reviews.
- Lead root cause analysis (RCA) and drive permanent reliability improvements.
- Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.
Kubernetes & Cloud Infrastructure
- Design, operate and scale Kubernetes clusters in public cloud environments.
- Work with managed Kubernetes platforms such as GKE, and apply cloud-native best practices.
- Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
- Manage infrastructure using Infrastructure as Code (Terraform).
- Optimize infrastructure for performance, reliability and cost efficiency.
Observability & Incident Intelligence
- Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and Open Telemetry.
- Define SLIs, SLOs and error budgets for platform and AI workloads.
- Drive signal-based alerting to reduce noise and improve response quality.
- Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.
CI/CD & Platform Automation
- Design and maintain CI/CD pipelines for services and infrastructure.
- Build internal automation tooling for:
- Progressive and canary deployments
- Auto-scaling and capacity planning
- Faster incident diagnosis and recovery
- Enable self-service Dev Ops workflows for engineering teams.
MLOps & AI Platform Reliability
- Own reliability and performance of STT, TTS and LLM inference pipelines.
- Design provider routing, failover and SLA enforcement mechanisms.
- Deploy, version and roll back AI models and inference services.
- Monitor inference latency, quality and drift in production systems.
- Operate GPU-backed inference workloads where applicable.
Security, Compliance & Resilience
- Enforce Dev Sec Ops practices across build and deploy pipelines.
- Implement network policies, encryption, secrets management and access controls.
- Drive disaster recovery, backup strategies and resilience testing.
- Contribute to SOC2 / ISO compliance and audits.
Collaboration & Engineering Excellence
- Partner with backend, AI and platform teams on architecture and reliability.
- Influence system design through a reliability-first mindset.
- Mentor junior engineers and raise the overall bar for operational excellence.
Qualifications
Must-Have
- 3–6 years of experience in Dev Ops, SRE or Platform Engineering roles.
- Strong hands-on experience with Kubernetes and Docker in production environments.
- Familiarity with public cloud platforms and managed Kubernetes services (such as GKE).
- Strong understanding of distributed systems and production debugging.
- Hands-on experience with observability systems.
- Proficiency with Infrastructure as Code (Terraform).
- Strong incident ownership and communication skills.
Good-to-Have
- Experience with MLOps or AI inference platforms.
- Familiarity with LLM pipelines, real-time streaming or telephony systems.
- Experience operating GPU workloads.
- Knowledge of AIOps, anomaly detection or intelligent alerting.
- Cloud cost optimization experience.
Why Blue Machines
- Build global-scale AI infrastructure from India.
- Operate real-time Voice AI systems with 14.5 M+ minutes in production.
- Work on low-latency, high-reliability platforms.
- Grow from Dev Ops/SRE into MLOps and AI platform engineering.
- High ownership, deep technical impact and real production scale
Software Engineer (SDE-2) – Dev Ops, SRE & MLOps Platform Engineering
Location: Bengaluru
Employment Type: Full-time
Team: Platform Engineering / Reliability
About Blue Machines
Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has processed 14.5 M+ minutes of production-grade AI agent conversations, operating latency-sensitive, always-on voice systems across geographies.
About the Role
We are hiring a hands-on Dev Ops / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering.
This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale, driving uptime, performance and resilience.
Key Responsibilities
Platform Reliability & SRE
- Own 99.9%+ platform uptime for real-time Voice AI workloads.
- Participate in on-call rotations, incident response and post-incident reviews.
- Lead root cause analysis (RCA) and drive permanent reliability improvements.
- Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.
Kubernetes & Cloud Infrastructure
- Design, operate and scale Kubernetes clusters in public cloud environments.
- Work with managed Kubernetes platforms such as GKE, and apply cloud-native best practices.
- Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
- Manage infrastructure using Infrastructure as Code (Terraform).
- Optimize infrastructure for performance, reliability and cost efficiency.
Observability & Incident Intelligence
- Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and Open Telemetry.
- Define SLIs, SLOs and error budgets for platform and AI workloads.
- Drive signal-based alerting to reduce noise and improve response quality.
- Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.
CI/CD & Platform Automation
- Design and maintain CI/CD pipelines for services and infrastructure.
- Build internal automation tooling for:
- Progressive and canary deployments
- Auto-scaling and capacity planning
- Faster incident diagnosis and recovery
- Enable self-service Dev Ops workflows for engineering teams.
MLOps & AI Platform Reliability
- Own reliability and performance of STT, TTS and LLM inference pipelines.
- Design provider routing, failover and SLA enforcement mechanisms.
- Deploy, version and roll back AI models and inference services.
- Monitor inference latency, quality and drift in production systems.
- Operate GPU-backed inference workloads where applicable.
Security, Compliance & Resilience
- Enforce Dev Sec Ops practices across build and deploy pipelines.
- Implement network policies, encryption, secrets management and access controls.
- Drive disaster recovery, backup strategies and resilience testing.
- Contribute to SOC2 / ISO compliance and audits.
Collaboration & Engineering Excellence
- Partner with backend, AI and platform teams on architecture and reliability.
- Influence system design through a reliability-first mindset.
- Mentor junior engineers and raise the overall bar for operational excellence.
Qualifications
Must-Have
- 3–6 years of experience in Dev Ops, SRE or Platform Engineering roles.
- Strong hands-on experience with Kubernetes and Docker in production environments.
- Familiarity with public cloud platforms and managed Kubernetes services (such as GKE).
- Strong understanding of distributed systems and production debugging.
- Hands-on experience with observability systems.
- Proficiency with Infrastructure as Code (Terraform).
- Strong incident ownership and communication skills.
Good-to-Have
- Experience with MLOps or AI inference platforms.
- Familiarity with LLM pipelines, real-time streaming or telephony systems.
- Experience operating GPU workloads.
- Knowledge of AIOps, anomaly detection or intelligent alerting.
- Cloud cost optimization experience.
Why Blue Machines
- Build global-scale AI infrastructure from India.
- Operate real-time Voice AI systems with 14.5 M+ minutes in production.
- Work on low-latency, high-reliability platforms.
- Grow from Dev Ops/SRE into MLOps and AI platform engineering.
- High ownership, deep technical impact and real production scale
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application