Job Description
Role Summary
Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.
You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.
Key Responsibilities
- Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.
- Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.
- Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.
- Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.
- Design, deploy, and fine-tune Slu...
Apply for this Position
Ready to join Firmus Technologies? Click the button below to submit your application.
Submit Application