Llm as a judge

Build and monitor multi-agent contact centers using Weights & Biases

Build and monitor multi-agent contact centers using Weights & Biases

This post explores the shift from costly legacy contact center software to multi-agent AI systems. It demonstrates how to build, monitor, and evaluate these complex agentic systems using the Weights & Biases AI Developer Platform, with a focus on tracing, quality assessment, and ensuring consistent customer support.

Production monitoring for AI applications using W&B Weave

Production monitoring for AI applications using W&B Weave

Learn how W&B Weave's online evaluations enable real-time monitoring of AI applications in production, allowing teams to track performance, catch failures, and iterate on quality over time using LLM-as-a-judge scores.

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Moving an AI agent from a promising demo to a reliable product is challenging. This talk presents a startup-friendly, iterative process for building robust evaluation frameworks, emphasizing that you must iterate on the evaluations themselves—the metrics and the data—not just the prompts and models. It outlines a practical "crawl, walk, run" approach, starting with simple heuristics and scaling to an advanced system with automated checks and human-in-the-loop validation.