Llm evaluation

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

Edwin Chen, founder and CEO of Surge AI, discusses his contrarian approach to building a bootstrapped, billion-dollar company, the critical role of high-quality data and 'taste' in training AI, the flaws in current benchmarks, and why reinforcement learning environments are the next frontier for creating models that truly advance humanity.

Big updates to mlflow 3.0

Big updates to mlflow 3.0

Databricks’ Eric Peter and Corey Zumar introduce MLflow 3.0, focusing on its new "Agentic Insights" capabilities. They demonstrate how MLflow is evolving from providing tools for manual quality assurance in Generative AI to using intelligent agents to automatically find, diagnose, and prioritize issues, significantly speeding up the development lifecycle.

Evaluating the Cultural Relevance of AI Models and Products: Insights from the YUX Team

Evaluating the Cultural Relevance of AI Models and Products: Insights from the YUX Team

Drawing from their work fine-tuning an ASR model in Wolof and building a stereotype detection dataset, researchers from YUX share a practical toolbox for evaluating the cultural relevance of AI models and products. The session covers methods for data collection, model benchmarking, user testing, and introduces LOOKA, a platform for scalable human evaluation in the African context.

Evaluation-Driven Development with MLflow 3.0

Evaluation-Driven Development with MLflow 3.0

Yuki Watanabe from Databricks introduces Evaluation-Driven Development (EDD) as a critical methodology for building production-ready AI agents. This talk explores the five pillars of EDD and demonstrates how MLflow 3.0's new features—including one-line tracing, automated evaluation, human-in-the-loop feedback, and monitoring—provide a comprehensive toolkit to ensure agent quality and reliability.

Why Language Models Need a Lesson in Education

Why Language Models Need a Lesson in Education

Stephanie Kirmer, a staff machine learning engineer at DataGrail, adapts her experience as a former professor to address the challenge of evaluating LLMs in production. She proposes a robust methodology using LLM-based evaluators guided by rigorous, human-calibrated rubrics to bring objectivity and scalability to the subjective task of assessing text generation quality.

EDD: The Science of Improving AI Agents // Shahul Elavakkattil Shereef // Agents in Production 2025

EDD: The Science of Improving AI Agents // Shahul Elavakkattil Shereef // Agents in Production 2025

This talk introduces Eval-Driven Development (EDD) as a scientific alternative to 'vibe-based' iteration for improving AI agents. It covers quantitative evaluation (choosing strong end-to-end metrics, aligning LLM judges) and qualitative evaluation (using error and attribution analysis to debug failures), providing a structured framework for consistent agent improvement.