Job Description

About Pocket FM


Pocket FM is on a mission to deliver personalized and immersive audio experiences to listeners worldwide. We are revolutionizing the audio entertainment industry through long-form storytelling, supported by our cutting-edge platform that serves millions of listeners and generates billions of minutes of engagement monthly. We leverage Generative AI in producing content and streamlining operations, developing innovative solutions for cutting-edge challenges in the AI landscape across all modalities—text, audio, and images. With strong backing and rapid user base growth, Pocket FM is an exciting and dynamic place to join.


About the Role


We are seeking an experienced research scientist to drive innovation in long-form content generation and localization. Your work will focus on creating seamless, culturally-tailored storytelling experiences, evaluating content quality through user engagement metrics, and transforming research breakthroughs into tangible solutions. You will lead the development of state-of-the-art TTS systems to create highly natural and expressive voices for our immersive audio storytelling platform. Your focus will be on building low-latency, end-to-end neural speech models that can accurately capture emotion and cultural nuances in multiple languages. This role offers the opportunity to contribute to cutting-edge research while also having a direct and measurable impact on the company’s success.

The team is open for the candidate to be located anywhere in North America/India with the requirement to travel occasionally to meet the team in person a few times a year.


Key Responsibilities

  • Model Development :Design, implement, and optimize modern neural TTS systems, including diffusion- and flow-based architectures, neural codec–based speech generation, and LLM-conditioned or hybrid speech synthesis models for expressive, long-form audio.
  • Speech Controllability : Develop methods for fine-grained control over speech attributes like pitch, rhythm, emotion, and speaker style to enhance storytelling quality.
  • Efficiency & Latency : Optimize models for real-time inference and high-scale production, utilizing techniques like knowledge distillation and model quantization.
  • Multilingual Synthesis : Spearhead research into cross-lingual and multilingual TTS to support global content localization.
  • Quality Evaluation : Design and implement robust evaluation frameworks, including MOS (Mean Opinion Score) and objective metrics, to assess the naturalness and intelligibility of generated speech.



Qualifications

  • Domain Expertise : Demonstrated experience in speech synthesis, digital signal processing (DSP), and audio analysis.
  • TTS Tooling : Proficiency with speech-specific frameworks and libraries such as Coqui TTS, ESPnet, or NVIDIA NeMo.
  • Advanced Architectures : Hands-on experience with sequence-to-sequence models, GANs, Variational Autoencoders (VAEs), and Diffusion models for audio.
  • Data Processing : Experience in building high-quality audio datasets, including voice cloning, speaker verification, and handling prosody.
  • Master’s or PhD degree in Computer Science, Machine Learning, or a related field
  • Significant Python and applied research experience in industrial settings
  • Proficiency in frameworks such as PyTorch or TensorFlow
  • Demonstrated experience in deep learning, especially language modling with transformers and machine translation
  • Prior experience working with vector databases, search indices, or other data stores for search and retrieval use cases
  • Preference for fast-paced, collaborative projects with concrete goals, quantitatively tested through A/B experiments
  • Published research in peer-reviewed journals and conferences on relevant topics


Join us in this exciting opportunity to contribute to ground breaking research in Generative AI technologies that will impact millions of users globally.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application