Job Description

Site Reliability Engineer - Machine Learning Systems (Singapore)

Job Code: A A

Responsibilities

  • Ensure our ML systems operate efficiently for large model deployment, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi‑data center, multi‑region, and multi‑cloud scenarios.
  • Manage resource planning, cost, and budget, including computing and storage resources.
  • Implement global system disaster recovery, cluster machine governance, and enhance business service stability, resource utilization, and operational efficiency.
  • Build software tools, products, and systems to monitor and manage ML infrastructure and services efficiently.
  • Participate in the global team roster that ensures system and business on‑call support.

Minimum Qualifications

  • Bachelor’s degree or above in Computer Science, Computer Engineering, or related fields.
  • ...

Apply for this Position

Ready to join ByteDance? Click the button below to submit your application.

Submit Application