Job Description
DevOps & ML Ops Engineer would be responsible for developing and maintaining scalable, stable services that deliver machine learning models to end users with guaranteed uptime. The primary focus will be on the infrastructure, deployment, and continuous integration/continuous delivery (CI/CD) processes for our ML services.
Responsibilities
Manage resource allocation and workload scheduling for multiple ML services, ensuring efficient utilization of CPU/GPU resources and creating reliable queues based on service priorities.
Maintain VM environments and manage OS updates, keep up-to-date VM inventory.
Work alongside the Dev and QA team to detect hot spots in our applications and set preventative measure before it becomes a live issue.
Troubleshooting and provide solutions for system configurations.
Plan, execute and test disaster recovery.
Monitor and examine all application, performance, event, and system logs to assist in troubleshooting.
Responsi...
Responsibilities
Manage resource allocation and workload scheduling for multiple ML services, ensuring efficient utilization of CPU/GPU resources and creating reliable queues based on service priorities.
Maintain VM environments and manage OS updates, keep up-to-date VM inventory.
Work alongside the Dev and QA team to detect hot spots in our applications and set preventative measure before it becomes a live issue.
Troubleshooting and provide solutions for system configurations.
Plan, execute and test disaster recovery.
Monitor and examine all application, performance, event, and system logs to assist in troubleshooting.
Responsi...
Apply for this Position
Ready to join EPAM Systems? Click the button below to submit your application.
Submit Application