Job Description

Responsibilities

  • Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference
  • Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios
  • Responsible for resource management and planning, cost and budget, including computing and storage resources
  • Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
  • Build software tools, products and systems to monitor and manage the ML infrastructure and services efficiently
  • Be part of the global team roster that ensures system and business on-call support

Qualifications

  • Minimum Qualifications
    • Bachelor's degree or above, majoring in Computer Science, computer engineer...

Apply for this Position

Ready to join ByteDance? Click the button below to submit your application.

Submit Application