Testing

BDD, ADR, PRD, WTF: Capturing Decisions for Humans and AI Alike — Michal Cichra, Safe Intelligence

BDD, ADR, PRD, WTF: Capturing Decisions for Humans and AI Alike — Michal Cichra, Safe Intelligence

Michal Cichra from Safe Intelligence explains how to maintain consistency in AI-driven software development by capturing decisions and enforcing rules. He argues for reviving Behavior-Driven Development (BDD) with Cucumber to close the loop left by spec-driven development. The core idea is to enforce architectural and product decisions (ADRs, PRDs) through an automated loop of git hooks and CI, ensuring both human and AI developers adhere to established standards.

Context Is the New Code — Patrick Debois, Tessl

Context Is the New Code — Patrick Debois, Tessl

Patrick Debois argues that as AI coding agents become more capable, the context that drives them—prompts, rules, and memory—needs its own engineering discipline, akin to how we manage code. He introduces the Context Development Lifecycle (Generate, Evaluate, Distribute, and Observe) to make context a shared, repeatable, and improvable part of software delivery, creating a flywheel effect where better context leads to better agent output and continuous improvement.

Effect Oriented Programming • Bill Frasure, Bruce Eckel, James Ward & Andrew Harmel-Law • GOTO 2026

Effect Oriented Programming • Bill Frasure, Bruce Eckel, James Ward & Andrew Harmel-Law • GOTO 2026

Authors Bill Frasure, Bruce Eckel, and James Ward discuss the core concepts of Effect-Oriented Programming. They explain how effects are composable operations that encapsulate side effects and defer execution, allowing developers to manage unpredictability with compiler-checked types. The conversation covers ZIO, the expansion of effect systems into languages like TypeScript and Kotlin, and their unique, constraint-driven writing process.

Evals Aren't Useful? Really?

Evals Aren't Useful? Really?

A deep dive into the critical importance of robust evaluation for building reliable AI agents. The summary covers bootstrapping evaluation sets, advanced testing techniques like multi-turn simulations and red teaming, and the necessity of integrating traditional software engineering and MLOps practices into the agent development lifecycle.

Evaluating AI Agents: Why It Matters and How We Do It

Evaluating AI Agents: Why It Matters and How We Do It

Annie Condon and Jeff Groom from Acre Security detail their practical approach to robustly evaluating non-deterministic AI agents. They share their philosophy that evaluations are critical for quality, introduce their "X-ray machine" analogy for observability, and walk through their evaluation stack, including versioning strategies and the use of tools like Logfire for tracing and Confident AI (Deep Evals) for systematic metric tracking.

AI traces are worth a thousand logs

AI traces are worth a thousand logs

An exploration of how a single, structured trace, based on OpenTelemetry standards, offers a superior method for debugging, testing, and understanding AI agent behavior compared to traditional logging. Learn how programmatic access to traces enables robust evaluation and the creation of golden datasets for building more reliable autonomous systems.