Job Description
Responsibilities
- Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference
- Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios
- Responsible for resource management and planning, cost and budget, including computing and storage resources
- Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
- Build software tools, products and systems to monitor and manage the ML infrastructure and services efficiently
- Be part of the global team roster that ensures system and business on-call support
Qualifications
- Minimum Qualifications
- Bachelor's degree or above, majoring in Computer Science, computer engineer...
Apply for this Position
Ready to join ByteDance? Click the button below to submit your application.
Submit Application