Reliability

Beyond the Gold Standard: Evaluating and Trusting Agents in the Wild // Sanjana Sharma

Beyond the Gold Standard: Evaluating and Trusting Agents in the Wild // Sanjana Sharma

A deep dive into the challenges of deploying AI agents in production, arguing that reliability stems not from model intelligence but from a "system-first" approach. The talk introduces a new architecture that separates the LLM's reasoning from a versioned, auditable "Context Layer" containing business logic and expert knowledge, which is continuously updated through a "Living Ground Truth" loop driven by expert feedback.

Reinforcement Learning for Agents — with Amazon AGI Labs’ Antje Barth

Reinforcement Learning for Agents — with Amazon AGI Labs’ Antje Barth

Antje Barth from Amazon's AGI Labs discusses Nova Act, a new service for building reliable AI agents. She explores how they achieve over 90% reliability using reinforcement learning in 'web gyms', the shift towards 'normcore' agents for practical automation, and the future of AI as a digital co-worker.

Catastrophic agent failure and how to avoid it // Edward Upton // Agents in Production 2025

Catastrophic agent failure and how to avoid it // Edward Upton // Agents in Production 2025

Edward, a founding engineer at Asteroid, discusses the critical challenge of managing catastrophic failures in agentic browser solutions, particularly in high-stakes domains like healthcare and insurance. He shares real-world examples of agent failures and outlines a practical framework for building more reliable, predictable, and accountable agents by scoping their capabilities, implementing robust human-in-the-loop tooling, and employing independent evaluation systems.

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Ido Pesok from Vercel explains why LLM-based applications often fail in production despite successful demos, and presents a systematic framework for building reliable AI systems using application-layer evaluations ("evals").

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Moving an AI PoC from 50% to 100% reliability requires a new development paradigm. This talk introduces a practical, evaluations-first approach, reverse-engineering tests from real-world user scenarios and business outcomes to build a robust benchmark, prevent regressions, and enable confident optimization.

From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

Brooke Hopkins, founder of Coval, discusses how evaluation methodologies from the autonomous vehicle industry, particularly from her experience at Waymo, can be adapted to build reliable, scalable, and trustworthy voice and conversational AI systems.