Speech synthesis

From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

Thor Schaeff from Google DeepMind demos the advanced audio AI stack, starting with a single API call to Gemini for rich transcription (speaker names, emotions, translation). He showcases speech generation directed by "director's notes" instead of a voice catalog, the real-time, sound-to-sound Gemini 1.5 Flash Live model, and a live demo of Gemini Live using the Lyria 2 model as a tool to generate a full song on stage.

MLX Genmedia — Prince Canuma, Arcee

MLX Genmedia — Prince Canuma, Arcee

A tour of MLX, the on-device AI framework for Apple Silicon. This talk explores real-world applications from real-time vision and multimodal omni models to sub-100ms speech synthesis and video generation, all running locally. It highlights breakthrough techniques like Turbo Quant for 1M context and showcases community projects in robotics and native apps, arguing for a future where powerful AI runs without the cloud.

Mistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Mistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Mistral's Pavan (Voxtral lead) and Guillaume (Chief Scientist) discuss the new Voxtral TTS model, its novel architecture using flow matching for efficient, high-quality speech generation. They elaborate on Mistral's strategy of delivering specialized, open-weight models and the Mistral Forge platform, which empowers enterprises to leverage their proprietary data through fine-tuning for privacy, cost-effectiveness, and superior performance. The conversation also covers Mistral Small, the future of AI for science, and the company's commitment to open-source and foundational research, including formal proving as a proxy for long-horizon reasoning.