Prompt optimization

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

This workshop by Mahmoud Mabrouk, CEO of Agenta AI, delves into building calibrated LLM-as-a-judge evaluations that reliably align with human judgment. It highlights how miscalibrated judges lead to false confidence and presents a practical workflow, including designing use-case specific metrics, detailed data annotation, and optimizing judge prompts using the GAPA algorithm. The talk emphasizes the importance of iterative debugging, model selection, and custom reflection templates for achieving trustworthy and effective LLM evaluations.

DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners

DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners

An in-depth guide to DSPy, a framework for programming with language models, not just prompting them. Learn its core concepts—Signatures, Modules, Adapters, and Optimizers—and see real-world examples of building robust, testable, and transferable AI applications for the enterprise.

Continual System Prompt Learning for Code Agents – Aparna Dhinakaran, Arize

Continual System Prompt Learning for Code Agents – Aparna Dhinakaran, Arize

The talk by Aparna Dhinakaran introduces "system prompt learning" as an efficient alternative to traditional Reinforcement Learning for improving large language model-based coding agents. By leveraging LLM-as-a-judge evaluations to generate English feedback and explanations for code failures, agents can automatically refine their system prompts and rules. This method, demonstrated on Claude and Klein, significantly boosts performance on benchmarks like SWEBench with minimal data, highlighting the critical role of high-quality evaluation prompts.