Ai benchmarks

Fable 5: The Full Story from Capabilities to Drama (Ep. 1002 with Jon Krohn)

Fable 5: The Full Story from Capabilities to Drama (Ep. 1002 with Jon Krohn)

Anthropic's highly anticipated Claude Fable 5 model, a public version of its advanced "Mythos class" AI with state-of-the-art capabilities in software, vision, and long-context tasks, was released and then swiftly pulled offline by the U.S. government after just three days. The removal, initiated as an export control action over national security concerns stemming from a disputed "jailbreak" claim, highlights the growing tension between frontier AI development, AI safety, and regulatory oversight.

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

Vincent Chen of Snorkel AI discusses the crucial gap between rapidly advancing AI capabilities and our ability to measure them. He presents a framework for building effective benchmarks, encompassing task quality, distributional diversity, model headroom, and robust evaluation methodologies, alongside the "art" of having a clear thesis, inspiring research roadmaps, and prioritizing researcher UX. He concludes by outlining three critical axes for future benchmarks: environment complexity, autonomy horizon, and output complexity, to better reflect real-world AI applications.

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

Despite benchmarks showing relentless progress, many users remain dissatisfied with LLM responses in real-world scenarios. This summary explores two key analyses—a custom 'nonsense question' benchmark and trends from Chatbot Arena's 'dislike both' data—to reveal the persistent gaps in model reasoning, reliability, and domain-specific understanding.

How Intelligent Is AI, Really?

How Intelligent Is AI, Really?

Greg Kamradt of the ARC Prize Foundation explains how the ARC-AGI benchmark is shifting the focus of AI evaluation from memorization to true intelligence, defined as the ability to generalize and learn new skills efficiently. He discusses the history of ARC-AGI, how it revealed the limits of early LLMs and highlighted the recent "reasoning breakthrough," and details the upcoming interactive ARC-AGI v3, which will measure AI performance against a human baseline with zero instructions.

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

Edwin Chen, founder and CEO of Surge AI, discusses his contrarian approach to building a bootstrapped, billion-dollar company, the critical role of high-quality data and 'taste' in training AI, the flaws in current benchmarks, and why reinforcement learning environments are the next frontier for creating models that truly advance humanity.

912: In Case You Missed It in July 2025  — with Jon Krohn (@JonKrohnLearns)

912: In Case You Missed It in July 2025 — with Jon Krohn (@JonKrohnLearns)

A review of five key interviews covering the importance of data-centric AI (DMLR) in specialized fields like law, the challenges of AI benchmarking, strategies for domain-specific model selection using red teaming, the power of AI in predicting human behavior, and the shift towards building causal AI models.