Snorkel ai

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

Vincent Chen of Snorkel AI discusses the crucial gap between rapidly advancing AI capabilities and our ability to measure them. He presents a framework for building effective benchmarks, encompassing task quality, distributional diversity, model headroom, and robust evaluation methodologies, alongside the "art" of having a clear thesis, inspiring research roadmaps, and prioritizing researcher UX. He concludes by outlining three critical axes for future benchmarks: environment complexity, autonomy horizon, and output complexity, to better reflect real-world AI applications.

Task Fidelity Scaling Laws — Kobie Crawdord, Snorkel

Task Fidelity Scaling Laws — Kobie Crawdord, Snorkel

An experiment by Snorkel AI reveals that in agentic AI training, the quality of tasks is paramount. Using the same model and compute, fine-tuning on high-quality tasks yielded a 6% performance improvement, a 5x greater uplift compared to the 1% gain from low-quality tasks. The key difference lies in the nature of the tasks: high-quality tasks are genuinely harder, featuring more tool calls and cleaner failure modes that provide a meaningful learning signal. In contrast, low-quality tasks often fail due to ambiguity and environmental noise, hindering effective model improvement.