Job Description
Role : ML OPS Lead Engineer
Job Mode : Remote
Experience : 7+ Years
Notice Period : Immediate / 10 to 15 Days
Experience Required:
7+ years in platform or infrastructure engineering with significant experience in ML Ops, AI, and Cloud (Azure & AWS).
Key Responsibilities:
- Design, deploy, and manage scalable, secure, and high-performing cloud-based infrastructures across Azure and AWS.
- Lead end-to-end ML Ops lifecycle , including model deployment, monitoring, retraining, and CI/CD integration.
- Collaborate with AI/ML, Data Science, and DevOps teams to automate model lifecycle management and streamline ML workflows.
- Architect and implement governance, compliance, observability, and security frameworks for ML and GenAI systems.
- Drive innovation in Generative AI and Agentic AI ecosystems , integrating services like Azure OpenAI, Bedrock, Anthropic Claude, and OpenAI API.
- Implement infrastructure-as-code (IaC) practices using Terraform, Bicep, ARM, or CloudFormation .
- Manage networking, IAM, and security configurations across Azure and AWS environments.
- Establish monitoring, alerting, and performance dashboards using Grafana, Prometheus, Azure Monitor, and Log Analytics .
Required Technical Skills:
Cloud Platforms:
- Azure: Azure AI Services, Azure Search, Azure ML, Databricks, AKS, Azure AI Foundry, Azure AI Hub.
- AWS: SageMaker, Bedrock, Lambda, ECS, CDK, CloudFormation.
AI/ML & Generative AI:
- Exposure to Generative and Agentic AI ecosystems (Azure OpenAI, Bedrock, Claude, LlamaCloud, LangChain).
- Understanding of token usage, prompt injection, jailbreak risks , and mitigation methods.
- Experience with Azure AI Evaluation SDK and AI Red Teaming Prompt Security Scans .
- Hands-on experience with Python ML libraries (TensorFlow, PyTorch, Scikit-learn).
DevOps & Automation:
- Strong experience with Azure DevOps / AWS CodePipeline for CI/CD setup and management.
- Familiarity with Docker , Kubernetes , and container orchestration.
- Knowledge of IaC tools (Terraform, ARM/Bicep, CloudFormation).
Database & Storage:
- Azure Blob Storage, Cosmos DB, SQL, Key Vault, Data Lake Storage.
- AWS S3, DynamoDB, RDS, Redshift, Aurora.
- Understanding of OLTP and OLAP systems .
Networking & Security:
- Proficiency in DNS, VPNs, Load Balancing, VNets, IAM , and access control (RBAC, SCP, Azure Policy).
- Familiarity with Microsoft AD and principles of least privilege.
- Hands-on with KMS , Key Vault , and identity governance best practices.
ML Engineering & Workflow Management:
- Experience using Azure Machine Learning Studio, SDK (v2), CLI (v2) for model monitoring, retraining, and deployment.
- Build and optimize end-to-end ML workflows for production environments.
- Implement drift monitoring , model retraining , and technical & business validation processes.
- Collaborate with data scientists for model deployment and performance optimization.
Additional Skills (Good to Have):
- Experience with code assistant tools (GitHub Copilot, Cursor, Claude Code).
- Familiarity with Azure Bot Framework, APIM, Application Gateway .
- Exposure to M365 Copilot and related ecosystem tools.
- Proficiency with AWS Python SDK (Boto3) and AWS CDK .
Testing & Quality:
- Implement unit and integration testing in CI/CD workflows (preferably using ADO).
- Ensure testing and validation coverage for ML pipelines and infrastructure deployments.
Preferred Qualifications:
- Bachelor s or Master s in Computer Science, Information Technology, or related field.
- Certification(s) in Azure AI Engineer, AWS Machine Learning Specialty , or DevOps highly desirable.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application