Job Description
We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.
This is not a research role, not a general ML engineer role, and not cloud-agnostic.
Core Responsibilities
Distributed Training (Foundation-Scale)
- Build and operate multi-node, multi-GPU distributed training systems (16–128+ GPUs).
- Implement and tune:
- PyTorch Distributed (DDP, FSDP, TorchElastic)
- DeepSpeed (ZeRO-2 / ZeRO-3, CPU/NVMe offload)
- Hybrid parallelism (data, tensor, pipeline)
- Create reusable distributed training frameworks and templates for large models.
- Handle checkpoint sharding, failure recovery, and elastic scaling.
GPU Optimization (Google Cloud Only)
- Optimize GPU utilization and cost on Google Cloud GPUs:
- A100, H100, L4
- Achieve high utilization through:
- Mixed precision (FP16 / BF16)
- Gradient checkpointing
- Memory optimization and recomputation
- Tune NCCL communication (All-Reduce, All-Gather) for multi-node GCP clusters.
- Reduce GPU idle time and cost per training run.
Google Cloud Execution
- Run and optimize training jobs using:
- Vertex AI custom training
- GKE with GPU node pools
- Compute Engine GPU VMs
- Optimize GPU scheduling, scaling, and placement.
- Use preemptible GPUs safely for large training jobs.
Performance Profiling
- Profile and debug GPU workloads using:
- NVIDIA Nsight Systems / Compute
- DCGM
- Identify compute, memory, and communication bottlenecks.
- Produce performance benchmarks and optimization reports.
Required Experience (Recruiter Screening Criteria)
Must-Have Experience (Non-Negotiable)
- 8+ years in ML systems, distributed systems, or HPC
- Hands-on experience scaling multi-node GPU training (16+ GPUs)
- Deep expertise in:
- PyTorch Distributed
- DeepSpeed
- NCCL
- Direct production experience on Google Cloud GPUs
- Proven record of GPU performance and cost optimization
- Strongly Preferred
- Experience training foundation models / LLM-scale models
- Experience with Vertex AI + GKE
- Experience optimizing GPU workloads at enterprise scale
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application