Benchmarking

Why AI Needs Culture (Not Just Data) - Prolific [Sponsored]

Why AI Needs Culture (Not Just Data) - Prolific [Sponsored]

Sara Saab and Enzo Blindow from Prolific discuss the critical, and growing, need for high-quality human evaluation in the age of non-deterministic AI. They explore the limitations of current benchmarks, the dangers of agentic misalignment as revealed by Anthropic's research, and how Prolific is building a "science of evals" by treating human feedback as a robust infrastructure layer.

Evals in Action: From Frontier Research to Production Applications

Evals in Action: From Frontier Research to Production Applications

An overview of OpenAI's approach to AI evaluation, covering the GDP-val benchmark for frontier models and the practical tools available for developers to evaluate their own custom agents and applications.

Using LongMemEval to Improve Agent Memory

Using LongMemEval to Improve Agent Memory

Sam Bhagwat of Mastra details their process for optimizing AI agent memory using the Long Mem Eval benchmark. He breaks down memory into subtasks like temporal reasoning and knowledge updates, and shares how targeted improvements—such as tailored templates, targeted data updates, and structured message formatting—led to state-of-the-art performance, emphasizing the importance of iterative evaluation.

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Traditional benchmarks and leaderboards are insufficient for production AI. This summary details a practical, multi-layered evaluation strategy, moving from foundational system performance to factual accuracy and finally to safety and bias, using open-source tools like GuideLLM, lm-eval-harness, and Promptfoo.