Job Description

Role Summary
Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.
You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.
Key Responsibilities
Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.
Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.
Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.
Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.
Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configuratio...

Apply for this Position

Ready to join Firmus Technologies? Click the button below to submit your application.

Submit Application