Benchmarking

Task Fidelity Scaling Laws — Kobie Crawdord, Snorkel

Task Fidelity Scaling Laws — Kobie Crawdord, Snorkel

An experiment by Snorkel AI reveals that in agentic AI training, the quality of tasks is paramount. Using the same model and compute, fine-tuning on high-quality tasks yielded a 6% performance improvement, a 5x greater uplift compared to the 1% gain from low-quality tasks. The key difference lies in the nature of the tasks: high-quality tasks are genuinely harder, featuring more tool calls and cleaner failure modes that provide a meaningful learning signal. In contrast, low-quality tasks often fail due to ambiguity and environmental noise, hindering effective model improvement.

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Nicholas Kang and Michael Aaron from Google DeepMind's Kaggle team discuss the broken state of AI evaluations—scattered, non-transparent, and created by a homogenous group. They present their solutions: a community-driven benchmarks platform, a PvP Game Arena for non-saturating ELO ratings, standardized agent exams, and hackathons to crowdsource novel evals and address the limitations of current benchmarking practices.

OpenClaw's Memory Sucks and the fix is simple — Dhravya Shah, Supermemory

OpenClaw's Memory Sucks and the fix is simple — Dhravya Shah, Supermemory

Dhravya Shah, founder of Super Memory, details the evolution of his company from a simple RAG-based consumer app to a sophisticated, open-source context infrastructure for AI, and introduces a novel hooks-based memory solution for OpenClaw.

Are AI Benchmarks Telling The Full Story? [SPONSORED]

Are AI Benchmarks Telling The Full Story? [SPONSORED]

AI models are often benchmarked like Formula 1 cars, excelling on technical exams but failing the test of daily human experience. Researchers Andrew Gordon and Nora Petrova from Prolific critique the 'leaderboard illusion' of current ranking systems and introduce their HUMAINE leaderboard, a new framework that uses census-based sampling and the TrueSkill algorithm to measure how helpful, safe, and relatable models are to real people, not just tech enthusiasts.

Why AI Needs Culture (Not Just Data) - Prolific [Sponsored]

Why AI Needs Culture (Not Just Data) - Prolific [Sponsored]

Sara Saab and Enzo Blindow from Prolific discuss the critical, and growing, need for high-quality human evaluation in the age of non-deterministic AI. They explore the limitations of current benchmarks, the dangers of agentic misalignment as revealed by Anthropic's research, and how Prolific is building a "science of evals" by treating human feedback as a robust infrastructure layer.

Evals in Action: From Frontier Research to Production Applications

Evals in Action: From Frontier Research to Production Applications

An overview of OpenAI's approach to AI evaluation, covering the GDP-val benchmark for frontier models and the practical tools available for developers to evaluate their own custom agents and applications.