Job Description

We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.

This is not a research role, not a general ML engineer role, and not cloud-agnostic.


Core Responsibilities

Distributed Training (Foundation-Scale)

  • Build and operate multi-node, multi-GPU distributed training systems (16–128+ GPUs).
  • Implement and tune:
  • PyTorch Distributed (DDP, FSDP, TorchElastic)
  • DeepSpeed (ZeRO-2 / ZeRO-3, CPU/NVMe offload)
  • Hybrid parallelism (data, tensor, pipeline)
  • Create reusable distributed training frameworks and templates for large models.
  • Handle checkpoint sharding, failure recovery, and elastic scaling.

GPU Optimization (Google Cloud Only)

  • Optimize GPU utilization and cost on Google Cloud GPUs:
  • A100, H100, L4
  • Achieve high utilization through:
  • Mixed precision (FP16 / BF16)
  • Gradient checkpointing
  • Memory optimization and recomputation
  • Tune NCCL communication (All-Reduce, All-Gather) for multi-node GCP clusters.
  • Reduce GPU idle time and cost per training run.

Google Cloud Execution

  • Run and optimize training jobs using:
  • Vertex AI custom training
  • GKE with GPU node pools
  • Compute Engine GPU VMs
  • Optimize GPU scheduling, scaling, and placement.
  • Use preemptible GPUs safely for large training jobs.

Performance Profiling

  • Profile and debug GPU workloads using:
  • NVIDIA Nsight Systems / Compute
  • DCGM
  • Identify compute, memory, and communication bottlenecks.
  • Produce performance benchmarks and optimization reports.


Required Experience (Recruiter Screening Criteria)

Must-Have Experience (Non-Negotiable)

  • 8+ years in ML systems, distributed systems, or HPC
  • Hands-on experience scaling multi-node GPU training (16+ GPUs)
  • Deep expertise in:
  • PyTorch Distributed
  • DeepSpeed
  • NCCL
  • Direct production experience on Google Cloud GPUs
  • Proven record of GPU performance and cost optimization
  • Strongly Preferred
  • Experience training foundation models / LLM-scale models
  • Experience with Vertex AI + GKE
  • Experience optimizing GPU workloads at enterprise scale

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application