Feature

Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind

Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind

Google DeepMind's Ian Ballantyne and Gus Martins introduce Gemma 4, a family of open models delivering state-of-the-art performance with remarkable size efficiency. They discuss how models like the 31B variant outperform competitors 2-20x its size while running on a single GPU, the shift to an Apache 2.0 license to foster sovereignty and adoption, and the new economics of running powerful agentic workloads on hardware ranging from a Pixel phone to a single enterprise GPU.

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

Vincent Chen of Snorkel AI discusses the crucial gap between rapidly advancing AI capabilities and our ability to measure them. He presents a framework for building effective benchmarks, encompassing task quality, distributional diversity, model headroom, and robust evaluation methodologies, alongside the "art" of having a clear thesis, inspiring research roadmaps, and prioritizing researcher UX. He concludes by outlining three critical axes for future benchmarks: environment complexity, autonomy horizon, and output complexity, to better reflect real-world AI applications.

The Rise of the Full-Stack Builder and Hyper-Leveraged Generalist with Microsoft CEO Satya Nadella

The Rise of the Full-Stack Builder and Hyper-Leveraged Generalist with Microsoft CEO Satya Nadella

Microsoft CEO Satya Nadella discusses the future of AI at Microsoft Build, emphasizing an ecosystem approach where every company can create its own "frontier intelligence." He highlights the critical role of private evaluations as a new form of intellectual property, the strategic use of multi-modal harnesses for enterprise, and how autonomous AI agents are reshaping software development and business models. Nadella also shares insights on the societal impact of AI, from data center investments to the potential for AI-driven transformation in education.

Benchmarking semantic code retrieval on Claude Code — Kuba Rogut, Turbopuffer

Benchmarking semantic code retrieval on Claude Code — Kuba Rogut, Turbopuffer

A detailed benchmark analysis comparing raw Claude Code's performance with windowed grep and Turbopuffer's semantic search for code retrieval in LLM agents. The study reveals significant improvements in file precision (65% to 87%) and reduced wasted reads (1 in 3 to 1 in 8) with semantic search, while highlighting the importance of the agent's understanding of when to use retrieval tools.

Task Fidelity Scaling Laws — Kobie Crawdord, Snorkel

Task Fidelity Scaling Laws — Kobie Crawdord, Snorkel

An experiment by Snorkel AI reveals that in agentic AI training, the quality of tasks is paramount. Using the same model and compute, fine-tuning on high-quality tasks yielded a 6% performance improvement, a 5x greater uplift compared to the 1% gain from low-quality tasks. The key difference lies in the nature of the tasks: high-quality tasks are genuinely harder, featuring more tool calls and cleaner failure modes that provide a meaningful learning signal. In contrast, low-quality tasks often fail due to ambiguity and environmental noise, hindering effective model improvement.

GitHub’s Agent Era: 14x Commits, 200M Developers, Copilot’s Next Act — Kyle Daigle

GitHub’s Agent Era: 14x Commits, 200M Developers, Copilot’s Next Act — Kyle Daigle

GitHub COO Kyle Daigle discusses the new era of AI agents from the inside. He covers how he uses AI for leadership, the shift from "mega-skills" to "micro-skills," and how GitHub is navigating a 14x growth in commits. The conversation goes deep on the evolution of Copilot, the future of PRs in an agent-driven world, the challenges of scaling, and Microsoft's vision for an ambient AI operating system.