Production ai

Production monitoring for AI applications using W&B Weave

Production monitoring for AI applications using W&B Weave

Learn how W&B Weave's online evaluations enable real-time monitoring of AI applications in production, allowing teams to track performance, catch failures, and iterate on quality over time using LLM-as-a-judge scores.

Why Language Models Need a Lesson in Education

Why Language Models Need a Lesson in Education

Stephanie Kirmer, a staff machine learning engineer at DataGrail, adapts her experience as a former professor to address the challenge of evaluating LLMs in production. She proposes a robust methodology using LLM-based evaluators guided by rigorous, human-calibrated rubrics to bring objectivity and scalability to the subjective task of assessing text generation quality.

12-factor Agents - Patterns of reliable LLM applications // Dexter Horthy

12-factor Agents - Patterns of reliable LLM applications // Dexter Horthy

Drawing from conversations with top AI builders, Dex argues that production-grade AI agents are not magical loops but well-architected software. This talk introduces "12-Factor Agents," a methodology centered on "Context Engineering" to build reliable, high-performance LLM-powered applications by applying rigorous software engineering principles.

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Moving an AI PoC from 50% to 100% reliability requires a new development paradigm. This talk introduces a practical, evaluations-first approach, reverse-engineering tests from real-world user scenarios and business outcomes to build a robust benchmark, prevent regressions, and enable confident optimization.

Building Agents at Cloud Scale — Antje Barth, AWS

Building Agents at Cloud Scale — Antje Barth, AWS

A deep dive into building and scaling production-ready AI agents, detailing a model-driven approach using the open-source 'Strands' SDK and a cloud-native architecture for deploying remote tools with MCP and AWS Lambda.

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Traditional benchmarks and leaderboards are insufficient for production AI. This summary details a practical, multi-layered evaluation strategy, moving from foundational system performance to factual accuracy and finally to safety and bias, using open-source tools like GuideLLM, lm-eval-harness, and Promptfoo.