Job Description
Role - Sr. HPC Administrator
Years of Experience - 10 to 18 years
Location - Bangalore
- Graduate with at least 6 to 8 years of strong experience in handling HPC infrastructure.
- Strong experience in providing support for Linux HPC clusters.
- Strong working knowledge on Following:
- IBM Platform LSF 9 and 10 administration.
- Redhat Enterprise Linux Administration.
- Lustre Parallel File system.
- Mellanox Infiniband Connectivity.
- Cluster Manager Administration (HPCM or xCAT)
- SSSD & NIS Authentication mechanisms.
- Bash & Python scripting.
- Ansible playbooks.
- Experience of Abaqus, and CFD application (Fluent and StarCCM..etc.,)
- Strong knowledge of application installations and version management on shared file systems.
- IT infrastructure Technical Operation Management under ITIL framework
- Security compliance and remediation management. Intermediate Level
- DevOps, ITIL, Agile, Safe (certifications are desirable)
- Installation, configuration, troubleshooting and administration of Linux HPC clusters (compute, storage, and network) and applications in support of CAE environments.
- Monitor and analyze LSF job queues and resource utilization to optimize workload management.
- Troubleshoot and resolve any issues with LSF and its components, including master servers, compute nodes, and resource managers.
- Collaborate with users to understand their HPC requirements and design LSF job workflows to meet their needs.
- Develop and maintain LSF documentation, including standard operating procedures, installation guides, and troubleshooting procedures.
- Develop and maintain LSF scripts for automation and task scheduling.
- Diagnose and troubleshoot complex RHEL OS, application and HPC cluster technical problems.
- Interact with hardware and software vendors for external support.
- Develop and maintain technical solution documents (TSD) and standard operating procedures(SOP).
- Keep all HPC infrastructure systems/servers/devices up to date and working condition to enhance business continuity.
- Design and implement HPC network topology, including Mellanox connectivity.
- Create and maintain HPC capacity planning and periodical cluster utilization reports.
- Troubleshoot Abaqus, StarCCM+ and Fluent applications, and resolve any issues in a timely manner.
- Develop and maintain scripts for automation and task scheduling using Python and Bash scripting.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application