Reliability

Harnesses in AI: A Deep Dive — Tejas Kumar, IBM

Harnesses in AI: A Deep Dive — Tejas Kumar, IBM

A deep dive into AI harnesses, explaining how to build a programmatic environment around an LLM agent to ensure reliability without prompt engineering. The talk demonstrates building a harness for a browser agent to reliably log in and upvote a post on Hacker News using GPT-3.5 Turbo.

Write Reliable Software with Temporal

Write Reliable Software with Temporal

Johann Schleier-Smith from Temporal explains Durable Execution, a paradigm for building reliable, long-running applications. He details how Temporal's model of deterministic workflows and stateful activities provides a robust alternative to traditional checkpointing and event-driven architectures, especially for complex, LLM-driven agentic systems.

Beyond the Gold Standard: Evaluating and Trusting Agents in the Wild // Sanjana Sharma

Beyond the Gold Standard: Evaluating and Trusting Agents in the Wild // Sanjana Sharma

A deep dive into the challenges of deploying AI agents in production, arguing that reliability stems not from model intelligence but from a "system-first" approach. The talk introduces a new architecture that separates the LLM's reasoning from a versioned, auditable "Context Layer" containing business logic and expert knowledge, which is continuously updated through a "Living Ground Truth" loop driven by expert feedback.

Reinforcement Learning for Agents — with Amazon AGI Labs’ Antje Barth

Reinforcement Learning for Agents — with Amazon AGI Labs’ Antje Barth

Antje Barth from Amazon's AGI Labs discusses Nova Act, a new service for building reliable AI agents. She explores how they achieve over 90% reliability using reinforcement learning in 'web gyms', the shift towards 'normcore' agents for practical automation, and the future of AI as a digital co-worker.

Catastrophic agent failure and how to avoid it // Edward Upton // Agents in Production 2025

Catastrophic agent failure and how to avoid it // Edward Upton // Agents in Production 2025

Edward, a founding engineer at Asteroid, discusses the critical challenge of managing catastrophic failures in agentic browser solutions, particularly in high-stakes domains like healthcare and insurance. He shares real-world examples of agent failures and outlines a practical framework for building more reliable, predictable, and accountable agents by scoping their capabilities, implementing robust human-in-the-loop tooling, and employing independent evaluation systems.

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Ido Pesok from Vercel explains why LLM-based applications often fail in production despite successful demos, and presents a systematic framework for building reliable AI systems using application-layer evaluations ("evals").