Multimodality

⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind

⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind

An in-depth look at Gemma 4's novel transformer architecture with per-layer embeddings, enabling efficient parameter offloading for on-device inference. The discussion also covers its native multimodality, the state of fine-tuning, text-based diffusion models, and the growing intersection of research and engineering.

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

Guillaume Vernade from Google DeepMind demonstrates a full generative media pipeline, using Gemini to read a public domain book and act as a master prompt engineer for other models. Imagen generates character portraits, Veo animates scenes into video, Lyria composes a unique soundtrack for each chapter, and a clever TTS trick creates a multi-character audiobook.

Computer use in Codex

Computer use in Codex

Ari Weinstein discusses how Codex's 'computer use' feature allows the AI agent to operate local Mac applications in the background by combining multimodal vision with accessibility data, enabling non-intrusive, parallel task execution.

Build & deploy AI-powered apps — Paige Bailey, Google DeepMind

Build & deploy AI-powered apps — Paige Bailey, Google DeepMind

A developer-focused, demo-heavy session on rapid AI prototyping using the Google DeepMind stack. It covers how to leverage the full capabilities of AI Studio, from video analysis and code execution with Gemini 3.1 Flash, to building full-stack applications with databases, and exploring the frontiers of generative media with Genie 3, Veo 3.1 Lite, and Lyria 3.

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

Cassidy Hardin from Google DeepMind introduces Gemma 4, a new family of open-weight models with significant architectural and performance improvements. This summary covers the four new models (31B Dense, 26B MoE, and two "Effective" on-device models), deep dives into architectural changes like mixed global/local attention and Per-Layer Embeddings (PLE), and details the new native multimodal capabilities for vision and audio.

The Limits of Today’s AI Models

The Limits of Today’s AI Models

Karan Goel, CEO of Cartesia, discusses the fundamental limitations of Transformer architectures, arguing they behave more like retrieval systems than learning systems. He explains how State Space Models (SSMs) enable compression and abstraction, and why Cartesia is tackling multimodal intelligence by first solving for voice AI, aiming to develop a transferable 'recipe' for end-to-end representation learning.