Mlops

Why AI Agents Shouldn't Replace Your Fraud Models

Why AI Agents Shouldn't Replace Your Fraud Models

Varant Zanoyan, original author of the Chronon feature platform, introduces 'agentic experimentation'—a pattern where AI agents improve high-stakes ML systems without making live decisions. He explains how Chronon solves key challenges like infrastructure sprawl, safety, and reproducibility through a semantic API, branch-based isolation, and compute reuse, enabling agents to safely create production-ready pipelines for human review.

Playground in Prod - Optimising Agents in Production Environments — Samuel Colvin, Pydantic

Playground in Prod - Optimising Agents in Production Environments — Samuel Colvin, Pydantic

Samuel Colvin, creator of Pydantic, demonstrates a hands-on workflow for continuously optimizing AI agents in production. The session covers using Logfire for running evaluations, GEPA (Genetic Pareto) for autonomously evolving better prompts, and managed variables to deploy these improvements to live services without redeployment.

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

Filip Makraduli from Superlinked discusses the common infrastructure gaps and profiling mistakes encountered when deploying small embedding and transformer models. He introduces the Superlinked Inference Engine (SIE), an open-source solution designed for dynamic model loading, hot-swapping, and memory-aware eviction to maximize GPU utilization and streamline the path from development to production.

Getting Humans Out of the Way: How to Work with Teams of Agents

Getting Humans Out of the Way: How to Work with Teams of Agents

Rob Ennals, creator of Broomy, discusses a paradigm shift in working with AI coding agents: moving away from micromanagement towards orchestrating teams of parallel agents. The key is to design robust, automated validation systems and reshape the development environment to empower agents to work autonomously, efficiently, and at scale.

Why building eval platforms is hard — Phil Hetzel, Braintrust

Why building eval platforms is hard — Phil Hetzel, Braintrust

An evaluation platform is more than a simple test runner; it's a complex system for creating shared definitions of quality. This talk explores the evolution of eval platforms from basic spreadsheets to sophisticated, integrated systems, highlighting the hidden data and systems engineering challenges involved in making them credible, scalable, and usable for building trustworthy AI agents.

It's 2026, and We're Still Talking Evals

It's 2026, and We're Still Talking Evals

Maggie Konstanty, AI Product Manager at Prosus, provides a candid look into the realities of LLM evaluation in production. She argues that standard metrics like accuracy are misleading and advocates for a culture of continuous, goal-oriented evaluation focused on deep failure analysis and understanding real user behavior, asserting that mature teams inevitably build custom tooling to meet their specific needs.