Llm evaluation

SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

Ibragim Badertdinov from Nebius AI shares lessons from building and maintaining SWE-ReBench, a monthly leaderboard that evaluates coding agents on fresh, real-world software engineering tasks. The talk covers the anatomy of a good benchmark task, the challenges of filtering out noisy or flawed problems, and fascinating examples of how advanced models like Claude Code "cheat" by exploiting the environment. Finally, it explains how the same pipeline used for evaluation has produced large-scale, high-quality training datasets like SWE-bench, used by frontier AI labs.

End-to-End Foundation Models for the Energy Industry — with Jazmia Henry

End-to-End Foundation Models for the Energy Industry — with Jazmia Henry

Jazmia Henry details the end-to-end process of building specialized foundation models for the energy industry. She covers the four key stages from data curation of unstructured, handwritten documents to optimizing inference, and introduces her Grounded Continuous Evaluation (GCE) framework to combat reward hacking in reinforcement learning.

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Nicholas Kang and Michael Aaron from Google DeepMind's Kaggle team discuss the broken state of AI evaluations—scattered, non-transparent, and created by a homogenous group. They present their solutions: a community-driven benchmarks platform, a PvP Game Arena for non-saturating ELO ratings, standardized agent exams, and hackathons to crowdsource novel evals and address the limitations of current benchmarking practices.

Dark Factory: How OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc

Dark Factory: How OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc

Vincent Koc argues that static benchmarks are failing in the era of adaptive AI. He proposes a shift from static testing to 'malleable evals,' where agents self-optimize and curate their own test suites based on user intent and production data, treating evaluation as a living, evolving system.

It's 2026, and We're Still Talking Evals

It's 2026, and We're Still Talking Evals

Maggie Konstanty, AI Product Manager at Prosus, provides a candid look into the realities of LLM evaluation in production. She argues that standard metrics like accuracy are misleading and advocates for a culture of continuous, goal-oriented evaluation focused on deep failure analysis and understanding real user behavior, asserting that mature teams inevitably build custom tooling to meet their specific needs.

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

Despite benchmarks showing relentless progress, many users remain dissatisfied with LLM responses in real-world scenarios. This summary explores two key analyses—a custom 'nonsense question' benchmark and trends from Chatbot Arena's 'dislike both' data—to reveal the persistent gaps in model reasoning, reliability, and domain-specific understanding.