Job Description

Job Description

REQUIREMENTS:

  • Experience : 7.5+ Years
  • 10-12 years in infrastructure, platform, DevOps, or MLOps roles
  • Strong experience with cloud platforms (AWS/GCP/Azure) and Kubernetes
  • Hands-on experience deploying and operating LLMs (OpenAI, Anthropic, open-source models)
  • Proficiency with GPU infrastructure, model serving frameworks, and vector databases
  • Strong programming skills in Python; experience with Bash/Go is a plus
  • Experience with monitoring, logging, and performance tuning for distributed systems
  • Preferred Qualifications
  • Experience with LLM fine-tuning, RAG pipelines, and prompt/version management
  • Familiarity with tools like Terraform, Helm, Argo, Ray, or similar
  • Exposure to cost optimization strategies for large-scale AI systems
  • Responsibilities:

  • Design and manage scalable infrastructure for training, fine-tuning, serving, and monitoring LLMs
  • Build and maintain LLMOps pipelines (deployment, versioning, rollback, monitoring, evaluation)
  • Optimize inference performance (latency, throughput, cost) across GPU/accelerator stacks
  • Implement CI/CD, IaC, and automation for AI/ML workloads
  • Ensure observability, reliability, and governance of LLM systems in production
  • Collaborate with ML, platform, and product teams to operationalize AI solutions
  • Manage security, compliance, and access control for model and data pipelines
  • Qualifications

    Bachelor’s or master’s degree in computer science, Information Technology, or a related field.

    Apply for this Position

    Ready to join ? Click the button below to submit your application.

    Submit Application