Job Description

Role : ML OPS Lead Engineer

Job Mode : Remote

Experience : 7+ Years

Notice Period : Immediate / 10 to 15 Days


Experience Required:

7+ years in platform or infrastructure engineering with significant experience in ML Ops, AI, and Cloud (Azure & AWS).


Key Responsibilities:

  • Design, deploy, and manage scalable, secure, and high-performing cloud-based infrastructures across Azure and AWS.
  • Lead end-to-end ML Ops lifecycle , including model deployment, monitoring, retraining, and CI/CD integration.
  • Collaborate with AI/ML, Data Science, and DevOps teams to automate model lifecycle management and streamline ML workflows.
  • Architect and implement governance, compliance, observability, and security frameworks for ML and GenAI systems.
  • Drive innovation in Generative AI and Agentic AI ecosystems , integrating services like Azure OpenAI, Bedrock, Anthropic Claude, and OpenAI API.
  • Implement infrastructure-as-code (IaC) practices using Terraform, Bicep, ARM, or CloudFormation .
  • Manage networking, IAM, and security configurations across Azure and AWS environments.
  • Establish monitoring, alerting, and performance dashboards using Grafana, Prometheus, Azure Monitor, and Log Analytics .


Required Technical Skills:


Cloud Platforms:

  • Azure: Azure AI Services, Azure Search, Azure ML, Databricks, AKS, Azure AI Foundry, Azure AI Hub.
  • AWS: SageMaker, Bedrock, Lambda, ECS, CDK, CloudFormation.

AI/ML & Generative AI:

  • Exposure to Generative and Agentic AI ecosystems (Azure OpenAI, Bedrock, Claude, LlamaCloud, LangChain).
  • Understanding of token usage, prompt injection, jailbreak risks , and mitigation methods.
  • Experience with Azure AI Evaluation SDK and AI Red Teaming Prompt Security Scans .
  • Hands-on experience with Python ML libraries (TensorFlow, PyTorch, Scikit-learn).

DevOps & Automation:

  • Strong experience with Azure DevOps / AWS CodePipeline for CI/CD setup and management.
  • Familiarity with Docker , Kubernetes , and container orchestration.
  • Knowledge of IaC tools (Terraform, ARM/Bicep, CloudFormation).

Database & Storage:

  • Azure Blob Storage, Cosmos DB, SQL, Key Vault, Data Lake Storage.
  • AWS S3, DynamoDB, RDS, Redshift, Aurora.
  • Understanding of OLTP and OLAP systems .

Networking & Security:

  • Proficiency in DNS, VPNs, Load Balancing, VNets, IAM , and access control (RBAC, SCP, Azure Policy).
  • Familiarity with Microsoft AD and principles of least privilege.
  • Hands-on with KMS , Key Vault , and identity governance best practices.

ML Engineering & Workflow Management:

  • Experience using Azure Machine Learning Studio, SDK (v2), CLI (v2) for model monitoring, retraining, and deployment.
  • Build and optimize end-to-end ML workflows for production environments.
  • Implement drift monitoring , model retraining , and technical & business validation processes.
  • Collaborate with data scientists for model deployment and performance optimization.

Additional Skills (Good to Have):

  • Experience with code assistant tools (GitHub Copilot, Cursor, Claude Code).
  • Familiarity with Azure Bot Framework, APIM, Application Gateway .
  • Exposure to M365 Copilot and related ecosystem tools.
  • Proficiency with AWS Python SDK (Boto3) and AWS CDK .

Testing & Quality:

  • Implement unit and integration testing in CI/CD workflows (preferably using ADO).
  • Ensure testing and validation coverage for ML pipelines and infrastructure deployments.

Preferred Qualifications:

  • Bachelor s or Master s in Computer Science, Information Technology, or related field.
  • Certification(s) in Azure AI Engineer, AWS Machine Learning Specialty , or DevOps highly desirable.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application