Job Description
IndiaAI is building India's next-gen foundational LLMs. We're looking for a hands-on
Senior ML Engineer
experienced in large-scale pre-training, distributed GPU systems, and data creation pipelines. You will work with Megatron-LM, NVIDIA NeMo, DeepSpeed, PyTorch Distributed, and SLURM to train 7B–70B+ models on multi-node GPU clusters.
What You'll Do
Build & optimize LLM pre-training pipelines (7B–70B+).
Implement distributed training using PyTorch Distributed, DeepSpeed (ZeRO/FSDP), Megatron-LM, NVIDIA NeMo.
Manage multi-node GPU jobs via SLURM and optimize NCCL communication.
Lead large-scale data creation, cleaning, deduplication, tokenization & sharding for multilingual datasets (with focus on Indian languages).
Build high-throughput dataloaders, monitoring dashboards & training workflows.
Collaborate with infra teams to optimize GPU utilization, networking, and storage systems.
What You Bring
5+ years in ML Engineering / DL Systems.
Prior experi...
Senior ML Engineer
experienced in large-scale pre-training, distributed GPU systems, and data creation pipelines. You will work with Megatron-LM, NVIDIA NeMo, DeepSpeed, PyTorch Distributed, and SLURM to train 7B–70B+ models on multi-node GPU clusters.
What You'll Do
Build & optimize LLM pre-training pipelines (7B–70B+).
Implement distributed training using PyTorch Distributed, DeepSpeed (ZeRO/FSDP), Megatron-LM, NVIDIA NeMo.
Manage multi-node GPU jobs via SLURM and optimize NCCL communication.
Lead large-scale data creation, cleaning, deduplication, tokenization & sharding for multilingual datasets (with focus on Indian languages).
Build high-throughput dataloaders, monitoring dashboards & training workflows.
Collaborate with infra teams to optimize GPU utilization, networking, and storage systems.
What You Bring
5+ years in ML Engineering / DL Systems.
Prior experi...
Apply for this Position
Ready to join Confidential? Click the button below to submit your application.
Submit Application