Agent Evaluation

Dec 26, 2025

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

This talk provides a practical framework for product managers to move beyond simple "vibe checks" to implement rigorous, data-driven evaluation for LLM-powered products. Using a live demo of a multi-agent AI trip planner, the speaker breaks down essential methodologies, including human feedback, code-based checks, and LLM-as-a-judge systems, and demonstrates how to iterate on both prompts and the evals themselves to ensure consistent quality and build user trust.

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize