Job Description
Key Responsibilities
● Design, train, and optimize audio and video ML models , including classification, detection, segmentation, generative models, speech processing, and multimodal architectures.
● Develop and maintain data pipelines for large-scale audio/video datasets, ensuring quality, labeling consistency, and efficient ingestion.
● Implement model evaluation frameworks that measure robustness, latency, accuracy, and overall performance across real-world conditions.
● Work with product teams to transform research prototypes into production-ready models with reliable inference performance.
● Optimize models for scalability, low latency, and edge/cloud deployment , including quantization, pruning, and hardware-aware tuning.
● Collaborate with cross-functional teams to define technical requirements and experiment roadmaps.
● Monitor and troubleshoot production models, ensuring reliability and continuous improvement.
● Stay current with trends in deep learning, computer vision, speech processing, and multimodal AI .
Required Qualifications
● Bachelor's or Master's degree in Computer Science, Electrical Engineering, Machine Learning, or a related field (PhD a plus).
● Strong experience with deep learning frameworks such as PyTorch or TensorFlow.
● Proven experience training and deploying audio or video models , such as: ○ Speech recognition, speech enhancement, speaker identification
○ Audio classification, event detection
○ Video classification, action recognition, tracking
○ Video-to-text, lip reading, multimodal fusion models
● Solid understanding of neural network architectures (CNNs, RNNs, Transformers, diffusion models, etc.).
● Proficiency in Python , along with ML tooling for experimentation and production (e.g., NumPy, OpenCV, FFmpeg, PyTorch Lightning).
● Experience working with GPU/TPU environments , distributed training, and model optimization.
● Ability to write clean, maintainable production-quality code.
Preferred Qualifications
● Experience with foundation models or multimodal transformers (e.g., audio-language, video-language).
● Background in signal processing , feature extraction (MFCCs, spectrograms), or codec-level audio/video understanding.
● Experience with MLOps tools (e.g., MLflow, Weights & Biases, Kubeflow, Airflow).
● Knowledge of cloud platforms (AWS, GCP, Azure) and scalable model serving frameworks.
● Experience with real-time audio/video processing for streaming applications.
● Publications, open-source contributions, or competitive ML achievements are a plus.
Experience:
Min 2 years
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application