Job Description
Job Description
Qualifications
REQUIREMENTS:
- Experience : 7.5+ Years
- 10-12 years in infrastructure, platform, DevOps, or MLOps roles
- Strong experience with cloud platforms (AWS/GCP/Azure) and Kubernetes
- Hands-on experience deploying and operating LLMs (OpenAI, Anthropic, open-source models)
- Proficiency with GPU infrastructure, model serving frameworks, and vector databases
- Strong programming skills in Python; experience with Bash/Go is a plus
- Experience with monitoring, logging, and performance tuning for distributed systems
- Preferred Qualifications
- Experience with LLM fine-tuning, RAG pipelines, and prompt/version management
- Familiarity with tools like Terraform, Helm, Argo, Ray, or similar
- Exposure to cost optimization strategies for large-scale AI systems
Responsibilities:
- Design and manage scalable infrastructure for training, fine-tuning, serving, and monitoring LLMs
- Build and maintain LLMOps pipelines (deployment, versioning, rollback, monitoring, evaluation)
- Optimize inference performance (latency, throughput, cost) across GPU/accelerator stacks
- Implement CI/CD, IaC, and automation for AI/ML workloads
- Ensure observability, reliability, and governance of LLM systems in production
- Collaborate with ML, platform, and product teams to operationalize AI solutions
- Manage security, compliance, and access control for model and data pipelines
Qualifications
Bachelor’s or master’s degree in computer science, Information Technology, or a related field.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application