Job Description
Site Reliability Engineer - Machine Learning Systems (Singapore)
Job Code: A A
Responsibilities
- Ensure our ML systems operate efficiently for large model deployment, training, evaluation, and inference.
- Maintain stability of offline tasks/services across multi‑data center, multi‑region, and multi‑cloud scenarios.
- Manage resource planning, cost, and budget, including computing and storage resources.
- Implement global system disaster recovery, cluster machine governance, and enhance business service stability, resource utilization, and operational efficiency.
- Build software tools, products, and systems to monitor and manage ML infrastructure and services efficiently.
- Participate in the global team roster that ensures system and business on‑call support.
Minimum Qualifications
- Bachelor’s degree or above in Computer Science, Computer Engineering, or related fields. ...
Apply for this Position
Ready to join ByteDance? Click the button below to submit your application.
Submit Application