Observability

Evaluation-Driven Development with MLflow 3.0

Evaluation-Driven Development with MLflow 3.0

Yuki Watanabe from Databricks introduces Evaluation-Driven Development (EDD) as a critical methodology for building production-ready AI agents. This talk explores the five pillars of EDD and demonstrates how MLflow 3.0's new features—including one-line tracing, automated evaluation, human-in-the-loop feedback, and monitoring—provide a comprehensive toolkit to ensure agent quality and reliability.

Streamline evaluation, monitoring, optimization of AI data flywheel with NVIDIA and Weights & Biases

Streamline evaluation, monitoring, optimization of AI data flywheel with NVIDIA and Weights & Biases

A walkthrough of the NVIDIA Data Flywheel Blueprint, demonstrating how to use production data and Weights & Biases to systematically fine-tune AI agents. This process enhances model accuracy and efficiency by creating a continuous improvement cycle, moving beyond the limitations of prompt engineering.

The Hidden Bottlenecks Slowing Down AI Agents

The Hidden Bottlenecks Slowing Down AI Agents

Paul van der Boor and Bruce Martens from Prosus discuss the real bottlenecks in AI agent development, arguing that the primary challenges are not tools, but rather evaluation, data quality, and feedback loops. They detail their 'buy-first' philosophy, the practical reasons they often build in-house, and how new coding agents like Devon and Cursor are changing their development workflows.

MLflow 3.0: The Future of AI Agents

MLflow 3.0: The Future of AI Agents

Eric Peter from Databricks outlines the evolution from the traditional MLOps lifecycle to the more complex Agent Ops lifecycle. He details the five essential components of a successful agent development platform and introduces MLflow 3.0, a new release designed to provide a comprehensive, open-standard solution for building, evaluating, and deploying AI agents.

LLMOps for eval-driven development at scale

LLMOps for eval-driven development at scale

Mercari's engineering team shares their practical, evaluation-centric approach to LLMOps. Learn how they leverage tiered evaluations, strategic tooling for observability, and rapid iteration to productionize LLM features for over 23 million users, emphasizing that good 'evals' are often more critical than model fine-tuning or RAG.

From DevOps ‘Heart Attacks’ to AI-Powered Diagnostics With Traversal’s AI Agents

From DevOps ‘Heart Attacks’ to AI-Powered Diagnostics With Traversal’s AI Agents

Anish Agarwal and Raj Agrawal, co-founders of Traversal, discuss how their AI agents automate root cause analysis (RCA) for critical system failures. They detail their agent's architecture, which leverages causal inference and large-scale computation to systematically find the root cause in minutes, and argue that the rise of AI-generated code makes AI-powered debugging an essential capability for modern software engineering.