Error Analysis

Apr 10, 2026

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

This workshop by Mahmoud Mabrouk, CEO of Agenta AI, delves into building calibrated LLM-as-a-judge evaluations that reliably align with human judgment. It highlights how miscalibrated judges lead to false confidence and presents a practical workflow, including designing use-case specific metrics, detailed data annotation, and optimizing judge prompts using the GAPA algorithm. The talk emphasizes the importance of iterative debugging, model selection, and custom reflection templates for achieving trustworthy and effective LLM evaluations.

Aug 13, 2025

EDD: The Science of Improving AI Agents // Shahul Elavakkattil Shereef // Agents in Production 2025

This talk introduces Eval-Driven Development (EDD) as a scientific alternative to 'vibe-based' iteration for improving AI agents. It covers quantitative evaluation (choosing strong end-to-end metrics, aligning LLM judges) and qualitative evaluation (using error and attribution analysis to debug failures), providing a structured framework for consistent agent improvement.