Evaluation

I’m Teaching AI Self-Improvement Techniques

I’m Teaching AI Self-Improvement Techniques

Aman Khan from Arize discusses the challenges of building reliable AI agents and introduces a novel technique called "metaprompting". This method uses continuous, natural language feedback to optimize an agent's system prompt, effectively training its "memory" or context, leading to significant performance gains even for smaller models.

From Idea to $650M Exit: Lessons in Building AI Startups

From Idea to $650M Exit: Lessons in Building AI Startups

In a talk at AI Startup School, Casetext co-founder Jake Heller breaks down how he built and sold his AI legal assistant, CoCounsel, for $650 million. He provides a practical framework for founders on identifying valuable AI business ideas, building reliable products that go beyond simple demos, and creating a go-to-market strategy centered on trust and product quality.

Evals Aren't Useful? Really?

Evals Aren't Useful? Really?

A deep dive into the critical importance of robust evaluation for building reliable AI agents. The summary covers bootstrapping evaluation sets, advanced testing techniques like multi-turn simulations and red teaming, and the necessity of integrating traditional software engineering and MLOps practices into the agent development lifecycle.

Evals in Action: From Frontier Research to Production Applications

Evals in Action: From Frontier Research to Production Applications

An overview of OpenAI's approach to AI evaluation, covering the GDP-val benchmark for frontier models and the practical tools available for developers to evaluate their own custom agents and applications.

Evaluating AI Agents: Why It Matters and How We Do It

Evaluating AI Agents: Why It Matters and How We Do It

Annie Condon and Jeff Groom from Acre Security detail their practical approach to robustly evaluating non-deterministic AI agents. They share their philosophy that evaluations are critical for quality, introduce their "X-ray machine" analogy for observability, and walk through their evaluation stack, including versioning strategies and the use of tools like Logfire for tracing and Confident AI (Deep Evals) for systematic metric tracking.

Production monitoring for AI applications using W&B Weave

Production monitoring for AI applications using W&B Weave

Learn how W&B Weave's online evaluations enable real-time monitoring of AI applications in production, allowing teams to track performance, catch failures, and iterate on quality over time using LLM-as-a-judge scores.