Job Description

Dear all,

We are looking for a GPU Infrastructure Specialist to manage and optimize GPU-based environments for model hosting and high-performance computing workloads. The ideal candidate will have hands-on experience with NVIDIA/ AMD, SambaNova GPU ecosystems, and a strong background in resource management, performance tuning, and observability within large-scale AI/ML environments.



Responsibilities


  • Manage, configure, and maintain GPU infrastructure across on-premise and cloud environments.
  • Handle GPU resource allocation, scheduling, and orchestration for AI/ML workloads.
  • Oversee driver updates, operator management, and compatibility across multiple GPU vendors (NVIDIA, AMD, SambaNova).
  • Implement GPU tuning and performance optimization strategies to ensure efficient model inference and training performance.
  • Monitor GPU utilization, latency, and system health using observability and alerting tools (e.g., Prometheus, Grafana, NVIDIA DCGM, etc.).
  • Collaborate with AI engineers, DevOps, and MLOps teams to ensure seamless model deployment and hosting across GPU clusters.
  • Develop automation scripts and workflows for GPU provisioning, scaling, and lifecycle management.
  • Troubleshoot GPU performance issues, memory bottlenecks, and hardware-level anomalies.


Qualifications


  • Strong experience managing GPU infrastructure (NVIDIA, AMD, SambaNova).
  • Proficiency in resource scheduling and orchestration (Kubernetes, Slurm, Ray, or similar).
  • Knowledge of driver and operator management in multi-vendor environments.
  • Experience with GPU tuning, profiling, and performance benchmarking.
  • Familiarity with observability and alerting tools (Prometheus, Grafana, ELK Stack, etc.).
  • Hands-on experience with model hosting platforms (Triton Inference Server, TensorRT, ONNX Runtime, etc.) is a plus.
  • Working knowledge of Linux systems, Docker/Kubernetes, and CI/CD pipelines.
  • Strong scripting skills in Python, Bash, or Go.


Preferred Skills



  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
  • Certifications in GPU computing (e.g., NVIDIA Certified Administrator, CUDA, or similar).
  • Experience with AI/ML model lifecycle management in production environments.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application