Testing

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Moving an AI agent from a promising demo to a reliable product is challenging. This talk presents a startup-friendly, iterative process for building robust evaluation frameworks, emphasizing that you must iterate on the evaluations themselves—the metrics and the data—not just the prompts and models. It outlines a practical "crawl, walk, run" approach, starting with simple heuristics and scaling to an advanced system with automated checks and human-in-the-loop validation.

Reading Code Effectively: An Overlooked Developer Skill • Marit van Dijk & Hannes Lowette

Reading Code Effectively: An Overlooked Developer Skill • Marit van Dijk & Hannes Lowette

Marit van Dijk and Hannes Lowette discuss why reading code is a critical, yet underdeveloped, skill for software developers. They explore research-backed strategies like structured code reading clubs, leveraging modern IDEs and AI assistants to comprehend complex codebases, and the importance of empathy in code reviews. The conversation emphasizes using tests as documentation and writing clear commit messages to improve collaboration and long-term maintainability.

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Ido Pesok from Vercel explains why LLM-based applications often fail in production despite successful demos, and presents a systematic framework for building reliable AI systems using application-layer evaluations ("evals").

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Moving an AI PoC from 50% to 100% reliability requires a new development paradigm. This talk introduces a practical, evaluations-first approach, reverse-engineering tests from real-world user scenarios and business outcomes to build a robust benchmark, prevent regressions, and enable confident optimization.

Beyond the Prototype: Using AI to Write High-Quality Code - Josh Albrecht, Imbue

Beyond the Prototype: Using AI to Write High-Quality Code - Josh Albrecht, Imbue

Josh Albrecht, CTO of Imbue, discusses the engineering challenges in building reliable AI coding agents. He introduces Sculptor, an experimental environment designed to build trust in AI-generated code by focusing on preventing and detecting problems through structured workflows, automated testing, and AI-driven analysis, moving beyond simple code generation to create maintainable software.