Pv p benchmark

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Nicholas Kang and Michael Aaron from Google DeepMind's Kaggle team discuss the broken state of AI evaluations—scattered, non-transparent, and created by a homogenous group. They present their solutions: a community-driven benchmarks platform, a PvP Game Arena for non-saturating ELO ratings, standardized agent exams, and hackathons to crowdsource novel evals and address the limitations of current benchmarking practices.