Job Description

Role Summary

Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.

You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.

Key Responsibilities

  • Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.
  • Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.
  • Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.
  • Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.
  • Design, deploy, and fine-tune Slu...

Apply for this Position

Ready to join Firmus Technologies? Click the button below to submit your application.

Submit Application