Evaluations

Dec 23, 2025

Continual System Prompt Learning for Code Agents – Aparna Dhinakaran, Arize

The talk by Aparna Dhinakaran introduces "system prompt learning" as an efficient alternative to traditional Reinforcement Learning for improving large language model-based coding agents. By leveraging LLM-as-a-judge evaluations to generate English feedback and explanations for code failures, agents can automatically refine their system prompts and rules. This method, demonstrated on Claude and Klein, significantly boosts performance on benchmarks like SWEBench with minimal data, highlighting the critical role of high-quality evaluation prompts.

Aug 23, 2025

Five hard earned lessons about Evals — Ankur Goyal, Braintrust

Building successful AI applications requires a sophisticated engineering approach that goes beyond prompt engineering. This involves creating intentionally engineered evaluations (evals) that reflect user feedback, focusing on "context engineering" to optimize tool definitions and outputs, and maintaining a flexible, model-agnostic architecture to adapt to the rapidly evolving AI landscape.

Aug 03, 2025

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Moving an AI PoC from 50% to 100% reliability requires a new development paradigm. This talk introduces a practical, evaluations-first approach, reverse-engineering tests from real-world user scenarios and business outcomes to build a robust benchmark, prevent regressions, and enable confident optimization.