Job Description

About MasterControl

MasterControl Inc. is a leading provider of cloud-based quality and compliance software for life sciences and other regulated industries. Our mission is the same as that of our customers to bring life-changing products to more people sooner. The MasterControl Platform helps organizations digitize, automate and connect quality and compliance processes across the regulated product development life cycle. Over 1,000 companies worldwide rely on MasterControl solutions to achieve new levels of operational excellence across product development, clinical trials, regulatory affairs, quality management, supply chain, manufacturing and postmarket surveillance. For more information, visit .


Summary

At MasterControl, we’re building our internal AI Platform to power intelligent, scalable, and compliant AI systems in regulated industries. We are seeking an experienced MLOps Engineer with deep infrastructure expertise to help us automate, monitor, and scale machine learning workloads across diverse environments.

This is not just about deploying models. You’ll help define the backbone of our AI pipeline: managing CI/CD, Kubernetes, observability, versioning, orchestration, inference workloads and performance. You’ll work closely with Machine Learning Researchers/Engineers, Data Engineers, and Platform teams to make our AI Services and Products production-ready, resilient, and fast.


What You’ll Do

  • Design and maintain infrastructure for training, evaluating, and deploying machine learning models at scale.
  • Manage GPU orchestration on Kubernetes (EKS), including node autoscaling, bin-packing, taints/tolerations, and cost-aware scheduling strategies (e.g., spot/preemptible GPUs).
  • Build and optimize CI/CD pipelines for ML code, data versioning, and model artifacts using tools like GitHub Actions, Argo Workflows, and Terraform.
  • Manage and optimize containerized ML workloads on Kubernetes (EKS), including node auto-scaling, GPU orchestration, and runtime scheduling.
  • Develop and maintain observability for model and pipeline health (e.g., using Prometheus, Grafana, OpenTelemetry).
  • Collaborate with Data Scientists and ML Engineers to productionize notebooks, pipelines, and models.
  • Implement and work with security and compliance to bring best practices around model serving and data access
  • Support inference backends including vLLM, Hugging Face, NVIDIA Triton, and other runtime engines and Optimize GPU utilization
  • Develop tools to simplify model deployment, rollback, and A/B testing for experimentation and reliability.
  • Lead incident response and debugging of performance issues in production AI systems.


What You’ll Bring

  • 5+ years of experience in MLOps, infrastructure, or platform engineering.
  • Experience setting up and scaling training and fine-tuning pipelines for ML models in production environments.
  • Strong expertise in Kubernetes, container orchestration, and cloud-native architecture (AWS preferred), specifically with GPUs.
  • Hands-on with training frameworks like PyTorch Lightning, Hugging Face Accelerate, or DeepSpeed.
  • Proficiency in infrastructure-as-code (Terraform, Helm, Kustomize) and cloud platforms (AWS preferred).
  • Familiar with artifact tracking, experiment management, and model registries (e.g., MLflow, W&B, SageMaker Experiments).
  • Strong Python engineering skills and experience debugging ML workflows at scale.
  • Experience deploying and scaling inference workloads using modern ML frameworks.
  • Deep understanding of CI/CD systems and their role in ML production.
  • Working knowledge of monitoring and alerting systems for ML workloads.
  • A strong sense of ownership and commitment to quality, security, and operational excellence.


Nice to Have

  • Experience with GPU scheduling and autoscaling in Kubernetes.
  • Familiarity with model versioning and drift monitoring tools.
  • Knowledge of low-latency inference optimization (e.g., quantization, FP8, TensorRT).
  • Experience working in compliance or regulated industries.


Why Work Here?

#WhyWorkAnywhereElse?

MasterControl is a place where Exceptional Teams come together to do their best work. In fact, hiring Exceptional Teams is a core value of ours. MasterControl employees are surrounded by intelligent, motivated, and collaborative individuals. We like to call it #TheBestTeamOnThePlanet.

We work hard to develop and challenge our employees' skillsets, recognize their contributions, encourage professional development, and offer a one-of-a-kind culture. This is why we say #WhyWorkAnywhereElse? MasterControl could be your next (and last) career move!

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application