Evaluation

Feb 22, 2026

How AI covered a human’s paternity leave // Quinten Rosseel

A practitioner's guide to deploying a text-to-SQL agent in a real-world business environment. The talk covers the critical lessons learned in moving from concept to production, focusing on the importance of the communication channel (Slack), the necessity of a semantic layer over benchmark scores, and a pragmatic approach to system architecture, testing, and evaluation.

Jan 12, 2026

Multi-Agent Systems for the Misinformation Lifecycle

A detailed overview of a modular, five-agent system designed to combat the entire lifecycle of digital misinformation. Based on an ICWSM research paper, this practitioner's guide details the roles of the Classifier, Indexer, Extractor, Corrector, and Verifier agents. The system emphasizes scalability, explainability, and high precision, moving beyond the limitations of single-LLM solutions. The talk covers the complete blueprint, from agent coordination and MLOps to holistic evaluation and optimization strategies for production environments.

Dec 10, 2025

How We Built a Leading Reasoning Model (Olmo 3)

A comprehensive overview of the entire process behind building Olmo 3 Think, covering the full stack from pre-training architecture and data selection to the detailed post-training recipe involving SFT, DPO, and a deep dive into the advanced infrastructure for scaling Reinforcement Learning (RL). The summary also includes critical reflections on the challenges and nuances of evaluating modern reasoning models.

Nov 18, 2025

I’m Teaching AI Self-Improvement Techniques

Aman Khan from Arize discusses the challenges of building reliable AI agents and introduces a novel technique called "metaprompting". This method uses continuous, natural language feedback to optimize an agent's system prompt, effectively training its "memory" or context, leading to significant performance gains even for smaller models.

Oct 28, 2025

From Idea to $650M Exit: Lessons in Building AI Startups

In a talk at AI Startup School, Casetext co-founder Jake Heller breaks down how he built and sold his AI legal assistant, CoCounsel, for $650 million. He provides a practical framework for founders on identifying valuable AI business ideas, building reliable products that go beyond simple demos, and creating a go-to-market strategy centered on trust and product quality.

Oct 10, 2025

Evals Aren't Useful? Really?

A deep dive into the critical importance of robust evaluation for building reliable AI agents. The summary covers bootstrapping evaluation sets, advanced testing techniques like multi-turn simulations and red teaming, and the necessity of integrating traditional software engineering and MLOps practices into the agent development lifecycle.