Evaluation

Production monitoring for AI applications using W&B Weave

Production monitoring for AI applications using W&B Weave

Learn how W&B Weave's online evaluations enable real-time monitoring of AI applications in production, allowing teams to track performance, catch failures, and iterate on quality over time using LLM-as-a-judge scores.

Beyond the Chatbot: What Actually Works in Enterprise AI

Beyond the Chatbot: What Actually Works in Enterprise AI

Jay Alammar, Director at Cohere, discusses the practical adoption of Large Language Models in the enterprise. He covers the evolution of Retrieval-Augmented Generation (RAG) from a simple anti-hallucination tool to complex, agentic systems, the critical role of evaluation as intellectual property, and future trends like text diffusion and the increasing capability of smaller models for specialized business tasks.

Context Engineering: Lessons Learned from Scaling CoCounsel

Context Engineering: Lessons Learned from Scaling CoCounsel

Jake Heller, founder of Casetext, shares a pragmatic framework for turning powerful large language models like GPT-4 into reliable, professional-grade products. He details a rigorous, evaluation-driven approach to prompt and context engineering, emphasizing iterative testing, the critical role of high-quality context, and advanced techniques like reinforcement fine-tuning and strategic model selection.

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Moving an AI agent from a promising demo to a reliable product is challenging. This talk presents a startup-friendly, iterative process for building robust evaluation frameworks, emphasizing that you must iterate on the evaluations themselves—the metrics and the data—not just the prompts and models. It outlines a practical "crawl, walk, run" approach, starting with simple heuristics and scaling to an advanced system with automated checks and human-in-the-loop validation.

The Truth About LLM Training

The Truth About LLM Training

Paul van der Boor and Zulkuf Genc from Prosus discuss the practical realities of deploying AI agents in production. They cover their in-house evaluation framework, strategies for navigating the GPU market, the importance of fine-tuning over building from scratch, and how they use AI to analyze usage patterns in a privacy-preserving manner.

How to look at your data — Jeff Huber (Choma) + Jason Liu (567)

How to look at your data — Jeff Huber (Choma) + Jason Liu (567)

A detailed summary of a talk by Jeff Huber (Chroma) and Jason Liu on systematically improving AI applications. The talk covers using fast, inexpensive evaluations for retrieval systems (inputs) and applying structured data analysis and clustering to conversational logs (outputs) to derive actionable product insights.