Job Description
Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!
We are looking for a highly motivated Machine Learning Operations Engineer with 3–4 years of experience in building and deploying end-to-end ML products in production environments. The ideal candidate has a strong ML background in Binary/ Multi class Classification, Recommendation Chatbot Applications and deploying training/inference pipelines, with hands-on experience in CI/CD, monitoring, and Kubernetes deployments.
Key Responsibilities:
· Design, build, and deploy robust ML pipelines for training, fine-tuning, and inference of models (NLP-focused: NER, Classification).
· Develop and maintain CI/CD workflows for ML pipelines using Jenkins or similar tools, ensuring rapid and safe deployment to production.
· Implement model monitoring and alerting systems to track performance degradation and drift in real-time.
· Collaborate with cross-functional teams to retrain models on trigger events and integrate feedback loops into the ML lifecycle.
· Hands on with Helm deployment of ML Pipelines in Kubernetes cluster and optimize for scalable and resilient operations.
· Use MLflow, Kubeflow, and related tools for experiment tracking, model versioning, and reproducibility.
· Write clean, efficient, and scalable code in Python using frameworks such as PyTorch and CUDA.
· Experience with tuning, optimising LLM Applications performance in production.
Required Skills:
· Strong programming experience in Python and PyTorch.
· Hands-on experience with CI/CD pipelines using Jenkins.
· Proficient with Kubernetes for deploying and managing ML workloads.
· Experience with model training, fine-tuning, and inference pipeline development.
· Working knowledge of model monitoring and alerting systems (performance drift, latency, accuracy drop).
· Experience with MLflow, Kubeflow, and model versioning best practices.
· Solid understanding of NER, Text Classification, and common NLP tasks.
· Familiarity with CUDA for training models on GPU.
Good to Have:
· Experience with Generative AI systems in production.
· Prior experience with building or deploying applications in Hardwares such as L40S, H100, H200.
· Familiarity with LangChain, LangGraph, LangSmith for building LLM-powered agents and applications.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application