Human Preference

Dec 20, 2025

Are AI Benchmarks Telling The Full Story? [SPONSORED]

AI models are often benchmarked like Formula 1 cars, excelling on technical exams but failing the test of daily human experience. Researchers Andrew Gordon and Nora Petrova from Prolific critique the 'leaderboard illusion' of current ranking systems and introduce their HUMAINE leaderboard, a new framework that uses census-based sampling and the TrueSkill algorithm to measure how helpful, safe, and relatable models are to real people, not just tech enthusiasts.

Are AI Benchmarks Telling The Full Story? [SPONSORED]