Job Description
Job Description
REQUIREMENTS:
Experience : 7.5+ Years10-12 years in infrastructure, platform, DevOps, or MLOps rolesStrong experience with cloud platforms (AWS/GCP/Azure) and KubernetesHands-on experience deploying and operating LLMs (OpenAI, Anthropic, open-source models)Proficiency with GPU infrastructure, model serving frameworks, and vector databasesStrong programming skills in Python; experience with Bash/Go is a plusExperience with monitoring, logging, and performance tuning for distributed systemsPreferred QualificationsExperience with LLM fine-tuning, RAG pipelines, and prompt/version managementFamiliarity with tools like Terraform, Helm, Argo, Ray, or similarExposure to cost optimization strategies for large-scale AI systemsResponsibilities:
Design and manage scalable infrastructure for training, fine-tuning, serving, and monitoring LLMsBuild and maintain LLMOps pipelines (deployment, versioning, rollback, monitoring, evaluation)Optimize inference performance (latency, throughput, cost) across GPU/accelerator stacksImplement CI/CD, IaC, and automation for AI/ML workloadsEnsure observability, reliability, and governance of LLM systems in productionCollaborate with ML, platform, and product teams to operationalize AI solutionsManage security, compliance, and access control for model and data pipelinesQualifications
Bachelor’s or master’s degree in computer science, Information Technology, or a related field.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application