Evaluation

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Explore advanced techniques for building long-running AI agents, moving beyond simple loops. Learn why self-evaluation fails and adversarial evaluators succeed, how to manage context with structured handoffs instead of just compaction, and how to use negotiated 'sprint contracts' and detailed rubrics to build and test complex, full-stack applications autonomously.

Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

A deep dive into building, testing, and iterating on Agent Skills to improve AI agent performance. This workshop covers the core concepts of progressive disclosure, eval-driven development, and practical application using a real-world Supabase and PostgreSQL security scenario.

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

This workshop by Mahmoud Mabrouk, CEO of Agenta AI, delves into building calibrated LLM-as-a-judge evaluations that reliably align with human judgment. It highlights how miscalibrated judges lead to false confidence and presents a practical workflow, including designing use-case specific metrics, detailed data annotation, and optimizing judge prompts using the GAPA algorithm. The talk emphasizes the importance of iterative debugging, model selection, and custom reflection templates for achieving trustworthy and effective LLM evaluations.

How AI covered a human’s paternity leave // Quinten Rosseel

How AI covered a human’s paternity leave // Quinten Rosseel

A practitioner's guide to deploying a text-to-SQL agent in a real-world business environment. The talk covers the critical lessons learned in moving from concept to production, focusing on the importance of the communication channel (Slack), the necessity of a semantic layer over benchmark scores, and a pragmatic approach to system architecture, testing, and evaluation.

Fully Connected Tokyo: [Hands-on workshop] Automation of document workflows in financial industry

Fully Connected Tokyo: [Hands-on workshop] Automation of document workflows in financial industry

This workshop by Upstage demonstrates how to automate financial document workflows using a combination of their specialized Document AI (Document Parse) and Large Language Models (LLMs). The session covers building robust information extraction pipelines, addressing challenges like varied templates and data formatting, and implementing systematic evaluation using Weights & Biases Weave. It also presents real-world case studies from the insurance industry, showcasing significant improvements in efficiency and data utilization.

Multi-Agent Systems for the Misinformation Lifecycle

Multi-Agent Systems for the Misinformation Lifecycle

A detailed overview of a modular, five-agent system designed to combat the entire lifecycle of digital misinformation. Based on an ICWSM research paper, this practitioner's guide details the roles of the Classifier, Indexer, Extractor, Corrector, and Verifier agents. The system emphasizes scalability, explainability, and high precision, moving beyond the limitations of single-LLM solutions. The talk covers the complete blueprint, from agent coordination and MLOps to holistic evaluation and optimization strategies for production environments.