Job Description

We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.

This is not a research role, not a general ML engineer role, and not cloud-agnostic.

Core Responsibilities

Distributed Training (Foundation-Scale)

  • Build and operate multi-node, multi-GPU distributed training systems (16–128+ GPUs).
  • Implement and tune:
  • PyTorch Distributed (DDP, FSDP, TorchElastic)
  • DeepSpeed (ZeRO-2 / ZeRO-3, CPU/NVMe offload)
  • Hybrid parallelism (data, tensor, pipeline)
  • Create reusable distributed training frameworks and templates for large models.
  • Handle checkpoint sharding, failure recovery, and elastic scaling.

GPU Optimization (Google Cloud Only)

  • Optimize GPU utilization and cost on Google Cloud GPUs:
  • <...

Apply for this Position

Ready to join Google? Click the button below to submit your application.

Submit Application