Job Description
Job Description
REQUIREMENTS:
Experience : 7.5+ Years 10-12 years in infrastructure, platform, DevOps, or MLOps roles Strong experience with cloud platforms (AWS/GCP/Azure) and Kubernetes Hands-on experience deploying and operating LLMs (OpenAI, Anthropic, open-source models) Proficiency with GPU infrastructure, model serving frameworks, and vector databases Strong programming skills in Python; experience with Bash/Go is a plus Experience with monitoring, logging, and performance tuning for distributed systems Preferred Qualifications Experience with LLM fine-tuning, RAG pipelines, and prompt/version management Familiarity with tools like Terraform, Helm, Argo, Ray, or similar Exposure to cost optimization strategies for large-scale AI systems Responsibilities:
Design and manage scalable infrastructure for training, fine-tuning, serving, and monitoring LLMs Build and maintain LLMOps pipelines (deployment, versioning, rollback, monitoring, evaluation) Optimize inference performance (latency, throughput, cost) across GPU/accelerator stacks Implement CI/CD, IaC, and automation for AI/ML workloads Ensure observability, reliability, and governance of LLM systems in production Collaborate with ML, platform, and product teams to operationalize AI solutions Manage security, compliance, and access control for model and data pipelines Qualifications
Bachelor’s or master’s degree in computer science, Information Technology, or a related field.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application