Autoregressive transformer

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Samuel Humeau from Mistral explains the dominant architecture for modern text-to-speech (TTS) systems, which mirrors large language models. He details how neural audio codecs solve the information density problem, the autoregressive transformer backbone for generation, and the streaming techniques used to achieve low perceived latency in voice agents. The talk uses Mistral's open-weight TTS model as a practical example.