Job Description

Role - Sr. HPC Administrator

Years of Experience - 10 to 18 years

Location - Bangalore


  • Graduate with at least 6 to 8 years of strong experience in handling HPC infrastructure.
  • Strong experience in providing support for Linux HPC clusters.
  • Strong working knowledge on Following:
  • IBM Platform LSF 9 and 10 administration.
  • Redhat Enterprise Linux Administration.
  • Lustre Parallel File system.
  • Mellanox Infiniband Connectivity.
  • Cluster Manager Administration (HPCM or xCAT)
  • SSSD & NIS Authentication mechanisms.
  • Bash & Python scripting.
  • Ansible playbooks.
  • Experience of Abaqus, and CFD application (Fluent and StarCCM..etc.,)
  • Strong knowledge of application installations and version management on shared file systems.
  • IT infrastructure Technical Operation Management under ITIL framework
  • Security compliance and remediation management. Intermediate Level
  • DevOps, ITIL, Agile, Safe (certifications are desirable)
  • Installation, configuration, troubleshooting and administration of Linux HPC clusters (compute, storage, and network) and applications in support of CAE environments.
  • Monitor and analyze LSF job queues and resource utilization to optimize workload management.
  • Troubleshoot and resolve any issues with LSF and its components, including master servers, compute nodes, and resource managers.
  • Collaborate with users to understand their HPC requirements and design LSF job workflows to meet their needs.
  • Develop and maintain LSF documentation, including standard operating procedures, installation guides, and troubleshooting procedures.
  • Develop and maintain LSF scripts for automation and task scheduling.
  • Diagnose and troubleshoot complex RHEL OS, application and HPC cluster technical problems.
  • Interact with hardware and software vendors for external support.
  • Develop and maintain technical solution documents (TSD) and standard operating procedures(SOP).
  • Keep all HPC infrastructure systems/servers/devices up to date and working condition to enhance business continuity.
  • Design and implement HPC network topology, including Mellanox connectivity.
  • Create and maintain HPC capacity planning and periodical cluster utilization reports.
  • Troubleshoot Abaqus, StarCCM+ and Fluent applications, and resolve any issues in a timely manner.
  • Develop and maintain scripts for automation and task scheduling using Python and Bash scripting.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application