Evaluation

Evals in Action: From Frontier Research to Production Applications

Evals in Action: From Frontier Research to Production Applications

An overview of OpenAI's approach to AI evaluation, covering the GDP-val benchmark for frontier models and the practical tools available for developers to evaluate their own custom agents and applications.

Evaluating AI Agents: Why It Matters and How We Do It

Evaluating AI Agents: Why It Matters and How We Do It

Annie Condon and Jeff Groom from Acre Security detail their practical approach to robustly evaluating non-deterministic AI agents. They share their philosophy that evaluations are critical for quality, introduce their "X-ray machine" analogy for observability, and walk through their evaluation stack, including versioning strategies and the use of tools like Logfire for tracing and Confident AI (Deep Evals) for systematic metric tracking.

Production monitoring for AI applications using W&B Weave

Production monitoring for AI applications using W&B Weave

Learn how W&B Weave's online evaluations enable real-time monitoring of AI applications in production, allowing teams to track performance, catch failures, and iterate on quality over time using LLM-as-a-judge scores.

Beyond the Chatbot: What Actually Works in Enterprise AI

Beyond the Chatbot: What Actually Works in Enterprise AI

Jay Alammar, Director at Cohere, discusses the practical adoption of Large Language Models in the enterprise. He covers the evolution of Retrieval-Augmented Generation (RAG) from a simple anti-hallucination tool to complex, agentic systems, the critical role of evaluation as intellectual property, and future trends like text diffusion and the increasing capability of smaller models for specialized business tasks.

Context Engineering: Lessons Learned from Scaling CoCounsel

Context Engineering: Lessons Learned from Scaling CoCounsel

Jake Heller, founder of Casetext, shares a pragmatic framework for turning powerful large language models like GPT-4 into reliable, professional-grade products. He details a rigorous, evaluation-driven approach to prompt and context engineering, emphasizing iterative testing, the critical role of high-quality context, and advanced techniques like reinforcement fine-tuning and strategic model selection.

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Moving an AI agent from a promising demo to a reliable product is challenging. This talk presents a startup-friendly, iterative process for building robust evaluation frameworks, emphasizing that you must iterate on the evaluations themselves—the metrics and the data—not just the prompts and models. It outlines a practical "crawl, walk, run" approach, starting with simple heuristics and scaling to an advanced system with automated checks and human-in-the-loop validation.