Job Description

Role: Sr. HPC Administrator

Desired Experience Range: 7 - 12 yrs

Notice Period: Immediate to 60 Days only

Location of Requirement: Bangalore


JOB DESCRIPTION

● Strong experience in providing support for Linux HPC clusters.

● Strong working knowledge on Following:

o IBM Platform LSF 9 and 10 administration.

o Redhat Enterprise Linux Administration.

o Lustre Parallel File system.

o Mellanox Infiniband Connectivity.

o Cluster Manager Administration (HPCM or xCAT)

o SSSD & NIS Authentication mechanisms.

o Bash & Python scripting.

o Ansible playbooks.

● Experience of Abaqus, and CFD application (Fluent and StarCCM..etc.,)

● Strong knowledge of application installations and version management on shared file systems.

● IT infrastructure Technical Operation Management under ITIL framework

● Security compliance and remediation management.

Intermediate Level

● DevOps, ITIL, Agile, Safe (certifications are desirable)

Responsibilities

● Installation, configuration, troubleshooting and administration of Linux HPC clusters (compute,

storage, and network) and applications in support of CAE environments.

● Monitor and analyze LSF job queues and resource utilization to optimize workload management.

● Troubleshoot and resolve any issues with LSF and its components, including master servers, compute

nodes, and resource managers.

● Collaborate with users to understand their HPC requirements and design LSF job workflows to meet

their needs.

● Develop and maintain LSF documentation, including standard operating procedures, installation

guides, and troubleshooting procedures.

● Develop and maintain LSF scripts for automation and task scheduling.

● Diagnose and troubleshoot complex RHEL OS, application and HPC cluster technical problems.

● Interact with hardware and software vendors for external support.

● Develop and maintain technical solution documents (TSD) and standard operating procedures(SOP).

● Keep all HPC infrastructure systems/servers/devices up to date and working condition to enhance

business continuity.

● Design and implement HPC network topology, including Mellanox connectivity.

● Create and maintain HPC capacity planning and periodical cluster utilization reports.

● Troubleshoot Abaqus, StarCCM+ and Fluent applications, and resolve any issues in a timely manner.

● Develop and maintain scripts for automation and task scheduling using Python and Bash scripting.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application