Llm evaluation

Full Workshop: Build Your Own Deep Research Agents - Louis-François Bouchard, Paul Iusztin, Samridhi

Full Workshop: Build Your Own Deep Research Agents - Louis-François Bouchard, Paul Iusztin, Samridhi

This hands-on workshop details the construction of a sophisticated, dual-part AI system for producing high-quality technical content. It begins with an MCP-powered deep research agent that autonomously plans, searches the web, and analyzes sources like YouTube to synthesize a grounded research artifact. The second part is a constrained, deterministic writing workflow that transforms this research into polished, non-sloppy content using an innovative "Evaluator-Optimizer" pattern for iterative refinement. The session emphasizes crucial AI engineering principles, such as choosing between agentic and workflow-based architectures, and concludes with a deep dive into implementing practical observability and evaluation pipelines to ensure the system is both measurable and improvable.

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

This talk provides a practical framework for product managers to move beyond simple "vibe checks" to implement rigorous, data-driven evaluation for LLM-powered products. Using a live demo of a multi-agent AI trip planner, the speaker breaks down essential methodologies, including human feedback, code-based checks, and LLM-as-a-judge systems, and demonstrates how to iterate on both prompts and the evals themselves to ensure consistent quality and build user trust.

How Intelligent Is AI, Really?

How Intelligent Is AI, Really?

Greg Kamradt of the ARC Prize Foundation explains how the ARC-AGI benchmark is shifting the focus of AI evaluation from memorization to true intelligence, defined as the ability to generalize and learn new skills efficiently. He discusses the history of ARC-AGI, how it revealed the limits of early LLMs and highlighted the recent "reasoning breakthrough," and details the upcoming interactive ARC-AGI v3, which will measure AI performance against a human baseline with zero instructions.

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

Edwin Chen, founder and CEO of Surge AI, discusses his contrarian approach to building a bootstrapped, billion-dollar company, the critical role of high-quality data and 'taste' in training AI, the flaws in current benchmarks, and why reinforcement learning environments are the next frontier for creating models that truly advance humanity.

Big updates to mlflow 3.0

Big updates to mlflow 3.0

Databricks’ Eric Peter and Corey Zumar introduce MLflow 3.0, focusing on its new "Agentic Insights" capabilities. They demonstrate how MLflow is evolving from providing tools for manual quality assurance in Generative AI to using intelligent agents to automatically find, diagnose, and prioritize issues, significantly speeding up the development lifecycle.

Evaluating the Cultural Relevance of AI Models and Products: Insights from the YUX Team

Evaluating the Cultural Relevance of AI Models and Products: Insights from the YUX Team

Drawing from their work fine-tuning an ASR model in Wolof and building a stereotype detection dataset, researchers from YUX share a practical toolbox for evaluating the cultural relevance of AI models and products. The session covers methods for data collection, model benchmarking, user testing, and introduces LOOKA, a platform for scalable human evaluation in the African context.