Text to speech

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Samuel Humeau from Mistral explains the dominant architecture for modern text-to-speech (TTS) systems, which mirrors large language models. He details how neural audio codecs solve the information density problem, the autoregressive transformer backbone for generation, and the streaming techniques used to achieve low perceived latency in voice agents. The talk uses Mistral's open-weight TTS model as a practical example.

ElevenLabs CEO: Why Voice is the Next AI Interface

ElevenLabs CEO: Why Voice is the Next AI Interface

Mati Staniszewski, CEO of ElevenLabs, discusses the company's strategy for rapidly shipping research-grade AI. He covers their organizational structure of small, autonomous teams, a global and remote-first hiring philosophy, the transition from a creator-focused product to an enterprise platform, and the lessons learned in navigating complex licensing and scaling a go-to-market team.

Distant conversational speech recognition: Challenges and Opportunities

Distant conversational speech recognition: Challenges and Opportunities

Dr. Samuele Cornell from Carnegie Mellon University discusses the persistent challenges in distant automatic speech recognition (DASR) for spontaneous, multi-party conversations. He explains why state-of-the-art systems falter in real-world scenarios and presents recent advancements through three key efforts: (1) insights from the CHiME-7/8 DASR challenges, which benchmark robust meeting transcription; (2) progress towards unified end-to-end models that jointly handle diarization and recognition; and (3) novel techniques for generating realistic, large-scale training data using a combination of large language models and multi-speaker text-to-speech systems.