Job Description
We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.
This is not a research role, not a general ML engineer role, and not cloud-agnostic.
Core Responsibilities
Distributed Training (Foundation-Scale)
- Build and operate multi-node, multi-GPU distributed training systems (16–128+ GPUs).
- Implement and tune:
- PyTorch Distributed (DDP, FSDP, TorchElastic)
- DeepSpeed (ZeRO-2 / ZeRO-3, CPU/NVMe offload)
- Hybrid parallelism (data, tensor, pipeline)
- Create reusable distributed training frameworks and templates for large models.
- Handle checkpoint sharding, failure recovery, and elastic scaling.
GPU Optimization (Google Cloud Only)
- Optimize GPU utilization and cost on Google Cloud GPUs: <...
Apply for this Position
Ready to join Google? Click the button below to submit your application.
Submit Application