Job Description
Job Description
- Monitor and support AI IaaS infrastructure, including GPU servers, storage, and networking.
- Conduct first-level troubleshooting of InfiniBand networks, identifying connectivity or performance issues.
- Maintain and update infrastructure inventory, cabling, and IPAM data using NetBox.
- Support InfiniBand fabric operations and troubleshooting using UFM, and Ethernet network configuration and validation leveraging Verity.
- Perform initial triage of alerts across hardware, networking, and Kubernetes platforms.
- Manage the full lifecycle of bare metal nodes, including provisioning, reprovisioning, validation, and burn-in testing.
- Execute standard operating procedures (SOPs) for incident response and escalation.
- Validate hardware health (GPUs, NICs, disks, memory, etc.) and perform basic diagnostics.
- Escalate complex issues to L2 with clear logs, diagnostics, and impact assessment.
- Assis...
Apply for this Position
Ready to join Mirantis? Click the button below to submit your application.
Submit Application