Neural audio codec

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Samuel Humeau from Mistral explains the dominant architecture for modern text-to-speech (TTS) systems, which mirrors large language models. He details how neural audio codecs solve the information density problem, the autoregressive transformer backbone for generation, and the streaming techniques used to achieve low perceived latency in voice agents. The talk uses Mistral's open-weight TTS model as a practical example.

Mistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Mistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Mistral's Pavan (Voxtral lead) and Guillaume (Chief Scientist) discuss the new Voxtral TTS model, its novel architecture using flow matching for efficient, high-quality speech generation. They elaborate on Mistral's strategy of delivering specialized, open-weight models and the Mistral Forge platform, which empowers enterprises to leverage their proprietary data through fine-tuning for privacy, cost-effectiveness, and superior performance. The conversation also covers Mistral Small, the future of AI for science, and the company's commitment to open-source and foundational research, including formal proving as a proxy for long-horizon reasoning.