Job Description

Description GSPANN is hiring a Senior AI/ML Operations Engineer. The role focuses on building AIOps/MLOps systems and automating ML pipelines.

Role and Responsibilities

  • Architect and drive the implementation of scalable Artificial Intelligence for IT Operations (AIOps) and Machine Learning Operations (MLOps) frameworks.
  • Mentor junior engineers and data scientists by sharing best practices in model deployment and operational excellence.
  • Align technical strategies with business objectives through close collaboration with product managers, Site Reliability Engineers (SREs), and other key stakeholders.
  • Establish and uphold engineering standards, including Service-Level Agreements (SLAs), Service-Level Indicators (SLIs), and Service-Level Objectives (SLOs) for machine learning and AIOps services.
  • Design and manage Machine Learning (ML) CI/CD (Continuous Integration/Continuous Deployment) pipelines for model training, testing, deployment, and monitoring using tools such as Kubeflow, MLflow, and Apache Airflow.
  • Implement robust monitoring systems to track model performance metrics like drift, latency, and accuracy, and automate retraining workflows where necessary.
  • Lead model governance efforts by ensuring reproducibility, traceability, and compliance with frameworks such as FAIR (Findable, Accessible, Interoperable, Reusable), and maintaining audit logs.
  • Build AI/ML-powered solutions for proactive infrastructure monitoring, predictive alerting, and intelligent incident resolution.
  • Enhance anomaly detection and root cause analysis by integrating and optimizing observability tools such as Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana), Dynatrace, Splunk, and Datadog.
  • Automate response workflows using predefined playbooks, runbooks, and self-healing systems.
  • Apply statistical techniques and machine learning models to analyze logs, metrics, and distributed traces at scale.
  • Skills and Experience

  • Bachelor’s or Master’s degree in Computer Science, Data Engineering, Artificial Intelligence, Machine Learning, or a related field.
  • Certifications in AWS/GCP DevOps, Kubernetes, or MLOps is desirable.
  • 6+ years of hands-on experience in DevOps, MLOps, or AIOps, including at least 2 years in a leadership or senior engineering capacity.
  • Demonstrate expert-level coding skills in Python and Bash, with working knowledge of Go or Java.
  • Use Docker for containerization and Kubernetes for orchestration across major cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.
  • Work with CI/CD tools and infrastructure-as-code technologies like Terraform, Ansible, and Helm.
  • Possess in-depth knowledge of ML lifecycle management, performance monitoring, and pipeline orchestration.
  • Maintain large-scale observability and telemetry platforms effectively.
  • Work with streaming data technologies including Apache Kafka, Apache Spark, and Apache Flink.
  • Manage service mesh architectures such as Istio or Linkerd to ensure secure and efficient service communication.
  • Understand data privacy and regulatory standards including the General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA).
  • Apply for this Position

    Ready to join ? Click the button below to submit your application.

    Submit Application