LLM Evaluation

Aug 20, 2025

Evaluation-Driven Development with MLflow 3.0

Yuki Watanabe from Databricks introduces Evaluation-Driven Development (EDD) as a critical methodology for building production-ready AI agents. This talk explores the five pillars of EDD and demonstrates how MLflow 3.0's new features—including one-line tracing, automated evaluation, human-in-the-loop feedback, and monitoring—provide a comprehensive toolkit to ensure agent quality and reliability.

Aug 19, 2025

Why Language Models Need a Lesson in Education

Stephanie Kirmer, a staff machine learning engineer at DataGrail, adapts her experience as a former professor to address the challenge of evaluating LLMs in production. She proposes a robust methodology using LLM-based evaluators guided by rigorous, human-calibrated rubrics to bring objectivity and scalability to the subjective task of assessing text generation quality.

Aug 13, 2025

EDD: The Science of Improving AI Agents // Shahul Elavakkattil Shereef // Agents in Production 2025

This talk introduces Eval-Driven Development (EDD) as a scientific alternative to 'vibe-based' iteration for improving AI agents. It covers quantitative evaluation (choosing strong end-to-end metrics, aligning LLM judges) and qualitative evaluation (using error and attribution analysis to debug failures), providing a structured framework for consistent agent improvement.

Aug 08, 2025

912: In Case You Missed It in July 2025 — with Jon Krohn (@JonKrohnLearns)

A review of five key interviews covering the importance of data-centric AI (DMLR) in specialized fields like law, the challenges of AI benchmarking, strategies for domain-specific model selection using red teaming, the power of AI in predicting human behavior, and the shift towards building causal AI models.

Jul 31, 2025

From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

Brooke Hopkins, founder of Coval, discusses how evaluation methodologies from the autonomous vehicle industry, particularly from her experience at Waymo, can be adapted to build reliable, scalable, and trustworthy voice and conversational AI systems.

Jul 29, 2025

[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

This workshop, led by former Google product directors, introduces a methodology for building reliable and tunable evaluation metrics for LLM applications. It details how to create granular 'scoring systems' that break down complex evaluations into simple, objective signals, and then use these systems for model comparison, prompt optimization, and online reinforcement learning.

← Previous Next →