Job Description
**Summary:**
Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, comms lib and scheduling infrastructure.
**Required Skills:**
AI/HPC System Performance Engineer, PhD Responsibilities:
1. Active member of a multi-disciplinary team to develop solutions for large scale training systems
2. Responsible for the overall performance of the communication system, including perfo...
Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance and availability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across stack: network fabric and host networking, comms lib and scheduling infrastructure.
**Required Skills:**
AI/HPC System Performance Engineer, PhD Responsibilities:
1. Active member of a multi-disciplinary team to develop solutions for large scale training systems
2. Responsible for the overall performance of the communication system, including perfo...
Apply for this Position
Ready to join Meta? Click the button below to submit your application.
Submit Application