Job Description

Dear all,
We are looking for a GPU Infrastructure Specialist to manage and optimize GPU-based environments for model hosting and high-performance computing workloads. The ideal candidate will have hands-on experience with NVIDIA/ AMD, Samba Nova GPU ecosystems, and a strong background in resource management, performance tuning, and observability within large-scale AI/ML environments.

Responsibilities

Manage, configure, and maintain GPU infrastructure across on-premise and cloud environments.
Handle GPU resource allocation, scheduling, and orchestration for AI/ML workloads.
Oversee driver updates, operator management, and compatibility across multiple GPU vendors (NVIDIA, AMD, Samba Nova).
Implement GPU tuning and performance optimization strategies to ensure efficient model inference and training performance.
Monitor GPU utilization, latency, and system health using observability and alerting tools (e.g., Prometheus, Grafana, NVIDIA DCGM, etc.).
Collaborate with AI engineers, Dev Ops, and MLOps teams to ensure seamless model deployment and hosting across GPU clusters.
Develop automation scripts and workflows for GPU provisioning, scaling, and lifecycle management.
Troubleshoot GPU performance issues, memory bottlenecks, and hardware-level anomalies.
Qualifications

Strong experience managing GPU infrastructure (NVIDIA, AMD, Samba Nova).
Proficiency in resource scheduling and orchestration (Kubernetes, Slurm, Ray, or similar).
Knowledge of driver and operator management in multi-vendor environments.
Experience with GPU tuning, profiling, and performance benchmarking.
Familiarity with observability and alerting tools (Prometheus, Grafana, ELK Stack, etc.).
Hands-on experience with model hosting platforms (Triton Inference Server, Tensor RT, ONNX Runtime, etc.) is a plus.
Working knowledge of Linux systems, Docker/Kubernetes, and CI/CD pipelines.
Strong scripting skills in Python, Bash, or Go.
Preferred Skills

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
Certifications in GPU computing (e.g., NVIDIA Certified Administrator, CUDA, or similar).
Experience with AI/ML model lifecycle management in production environments.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application