Job Description
We are hiring an engineer who has personally built and optimized distributed training systems for large AI models and has deep, real-world experience optimizing GPU workloads specifically on Google Cloud.
This is not a research role, not a general ML engineer role, and not cloud-agnostic.
Core Responsibilities
Distributed Training (Foundation-Scale)
Build and operate multi-node, multi-GPU distributed training systems (16–128+ GPUs).
Implement and tune:
Py Torch Distributed (DDP, FSDP, Torch Elastic)
Deep Speed (Ze RO-2 / Ze RO-3, CPU/NVMe offload)
Hybrid parallelism (data, tensor, pipeline)
Create reusable distributed training frameworks and templates for large models.
Handle checkpoint sharding, failure recovery, and elastic scaling.
GPU Optimization (Google Cloud Only)
Optimize GPU utilization and cost on Google Cloud GPUs:
A100, H100, L4
Achieve high utilization through:
Mixed precision (FP16 / BF16)
Gradient checkpointing
Memory optimization and recomputation
Tune NCCL communication (All-Reduce, All-Gather) for multi-node GCP clusters.
Reduce GPU idle time and cost per training run.
Google Cloud Execution
Run and optimize training jobs using:
Vertex AI custom training
GKE with GPU node pools
Compute Engine GPU VMs
Optimize GPU scheduling, scaling, and placement.
Use preemptible GPUs safely for large training jobs.
Performance Profiling
Profile and debug GPU workloads using:
NVIDIA Nsight Systems / Compute
DCGM
Identify compute, memory, and communication bottlenecks.
Produce performance benchmarks and optimization reports.
Required Experience (Recruiter Screening Criteria)
Must-Have Experience (Non-Negotiable)
8+ years in ML systems, distributed systems, or HPC
Hands-on experience scaling multi-node GPU training (16+ GPUs)
Deep expertise in:
Py Torch Distributed
Deep Speed
NCCL
Direct production experience on Google Cloud GPUs
Proven record of GPU performance and cost optimization
Strongly Preferred
Experience training foundation models / LLM-scale models
Experience with Vertex AI + GKE
Experience optimizing GPU workloads at enterprise scale
This is not a research role, not a general ML engineer role, and not cloud-agnostic.
Core Responsibilities
Distributed Training (Foundation-Scale)
Build and operate multi-node, multi-GPU distributed training systems (16–128+ GPUs).
Implement and tune:
Py Torch Distributed (DDP, FSDP, Torch Elastic)
Deep Speed (Ze RO-2 / Ze RO-3, CPU/NVMe offload)
Hybrid parallelism (data, tensor, pipeline)
Create reusable distributed training frameworks and templates for large models.
Handle checkpoint sharding, failure recovery, and elastic scaling.
GPU Optimization (Google Cloud Only)
Optimize GPU utilization and cost on Google Cloud GPUs:
A100, H100, L4
Achieve high utilization through:
Mixed precision (FP16 / BF16)
Gradient checkpointing
Memory optimization and recomputation
Tune NCCL communication (All-Reduce, All-Gather) for multi-node GCP clusters.
Reduce GPU idle time and cost per training run.
Google Cloud Execution
Run and optimize training jobs using:
Vertex AI custom training
GKE with GPU node pools
Compute Engine GPU VMs
Optimize GPU scheduling, scaling, and placement.
Use preemptible GPUs safely for large training jobs.
Performance Profiling
Profile and debug GPU workloads using:
NVIDIA Nsight Systems / Compute
DCGM
Identify compute, memory, and communication bottlenecks.
Produce performance benchmarks and optimization reports.
Required Experience (Recruiter Screening Criteria)
Must-Have Experience (Non-Negotiable)
8+ years in ML systems, distributed systems, or HPC
Hands-on experience scaling multi-node GPU training (16+ GPUs)
Deep expertise in:
Py Torch Distributed
Deep Speed
NCCL
Direct production experience on Google Cloud GPUs
Proven record of GPU performance and cost optimization
Strongly Preferred
Experience training foundation models / LLM-scale models
Experience with Vertex AI + GKE
Experience optimizing GPU workloads at enterprise scale
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application