LLM-as-a-Judge | Tokenless

Llm as a judge

Apr 27, 2026

It's 2026, and We're Still Talking Evals

Maggie Konstanty, AI Product Manager at Prosus, provides a candid look into the realities of LLM evaluation in production. She argues that standard metrics like accuracy are misleading and advocates for a culture of continuous, goal-oriented evaluation focused on deep failure analysis and understanding real user behavior, asserting that mature teams inevitably build custom tooling to meet their specific needs.

Apr 10, 2026

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

This workshop by Mahmoud Mabrouk, CEO of Agenta AI, delves into building calibrated LLM-as-a-judge evaluations that reliably align with human judgment. It highlights how miscalibrated judges lead to false confidence and presents a practical workflow, including designing use-case specific metrics, detailed data annotation, and optimizing judge prompts using the GAPA algorithm. The talk emphasizes the importance of iterative debugging, model selection, and custom reflection templates for achieving trustworthy and effective LLM evaluations.

Dec 26, 2025

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

This talk provides a practical framework for product managers to move beyond simple "vibe checks" to implement rigorous, data-driven evaluation for LLM-powered products. Using a live demo of a multi-agent AI trip planner, the speaker breaks down essential methodologies, including human feedback, code-based checks, and LLM-as-a-judge systems, and demonstrates how to iterate on both prompts and the evals themselves to ensure consistent quality and build user trust.

Dec 23, 2025

Continual System Prompt Learning for Code Agents – Aparna Dhinakaran, Arize

The talk by Aparna Dhinakaran introduces "system prompt learning" as an efficient alternative to traditional Reinforcement Learning for improving large language model-based coding agents. By leveraging LLM-as-a-judge evaluations to generate English feedback and explanations for code failures, agents can automatically refine their system prompts and rules. This method, demonstrated on Claude and Klein, significantly boosts performance on benchmarks like SWEBench with minimal data, highlighting the critical role of high-quality evaluation prompts.

Oct 28, 2025

Build and monitor multi-agent contact centers using Weights & Biases

This post explores the shift from costly legacy contact center software to multi-agent AI systems. It demonstrates how to build, monitor, and evaluate these complex agentic systems using the Weights & Biases AI Developer Platform, with a focus on tracing, quality assessment, and ensuring consistent customer support.

Sep 17, 2025

Production monitoring for AI applications using W&B Weave

Learn how W&B Weave's online evaluations enable real-time monitoring of AI applications in production, allowing teams to track performance, catch failures, and iterate on quality over time using LLM-as-a-judge scores.