Llm evaluation

912: In Case You Missed It in July 2025  — with Jon Krohn (@JonKrohnLearns)

912: In Case You Missed It in July 2025 — with Jon Krohn (@JonKrohnLearns)

A review of five key interviews covering the importance of data-centric AI (DMLR) in specialized fields like law, the challenges of AI benchmarking, strategies for domain-specific model selection using red teaming, the power of AI in predicting human behavior, and the shift towards building causal AI models.

From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

Brooke Hopkins, founder of Coval, discusses how evaluation methodologies from the autonomous vehicle industry, particularly from her experience at Waymo, can be adapted to build reliable, scalable, and trustworthy voice and conversational AI systems.

[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

This workshop, led by former Google product directors, introduces a methodology for building reliable and tunable evaluation metrics for LLM applications. It details how to create granular 'scoring systems' that break down complex evaluations into simple, objective signals, and then use these systems for model comparison, prompt optimization, and online reinforcement learning.

No Priors Ep. 124 | With SurgeAI Founder and CEO Edwin Chen

No Priors Ep. 124 | With SurgeAI Founder and CEO Edwin Chen

Edwin Chen, CEO of Surge AI, discusses the critical role of high-quality human data in training frontier models, the flaws in current evaluation benchmarks like LMSys and IF-Eval, the future of complex RL environments, and why he bootstrapped Surge to over $1 billion in revenue.