Audio generation

From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

Thor Schaeff from Google DeepMind demos the advanced audio AI stack, starting with a single API call to Gemini for rich transcription (speaker names, emotions, translation). He showcases speech generation directed by "director's notes" instead of a voice catalog, the real-time, sound-to-sound Gemini 1.5 Flash Live model, and a live demo of Gemini Live using the Lyria 2 model as a tool to generate a full song on stage.

Mistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Mistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Mistral's Pavan (Voxtral lead) and Guillaume (Chief Scientist) discuss the new Voxtral TTS model, its novel architecture using flow matching for efficient, high-quality speech generation. They elaborate on Mistral's strategy of delivering specialized, open-weight models and the Mistral Forge platform, which empowers enterprises to leverage their proprietary data through fine-tuning for privacy, cost-effectiveness, and superior performance. The conversation also covers Mistral Small, the future of AI for science, and the company's commitment to open-source and foundational research, including formal proving as a proxy for long-horizon reasoning.

Introducing Sora 2

Introducing Sora 2

A detailed overview of OpenAI's announcement of Sora 2, a flagship video and audio generation model, and the new Sora app, which introduces novel features like "Cameo" for personalized content creation and a new social experience.

Make some noise: Teaching the language of audio to an LLM using sound tokens

Make some noise: Teaching the language of audio to an LLM using sound tokens

Shivam Mehta from KTH presents a method for teaching Large Language Models (LLMs) to understand and generate audio by treating it as a discrete language. The approach involves a two-step process: first, creating an ultra-low bitrate (0.293 kbps) audio representation using a causal variational autoencoder, and second, fine-tuning a Llama 7B model with these audio tokens using LoRA.