Model Evaluation

May 27, 2026

Power agents with full context of your experiments and traces with W&B MCP server

The W&B Model Context Protocol (MCP) is a hosted endpoint that enables AI agents to intelligently interact with all Weights & Biases data, including runs, traces, evaluations, and reports. It features discovery tools for smart queries, automated analysis for comparing experiments and identifying regressions, and seamless integration with IDEs, coding agents, and chat interfaces like Mistral AI for streamlined ML workflows and on-the-go reporting.

Jan 09, 2026

Collaborative AI Agents At OpenAI

Robert from OpenAI discusses the critical role of structured evaluations (evals) and graders for developing advanced collaborative agents. He explores the limitations of 'vibe-based' assessments, introduces a maturity model for evals, and presents a comprehensive rubric for measuring agent performance beyond simple accuracy, connecting these concepts to the power of Reinforcement Fine-Tuning (RFT).

Jan 09, 2026

KDD '25 AI Reasoning Day keynote: Improving AI Reasoning through Intent, Interaction, and Inspection

A deep dive into practical strategies for improving AI reasoning in code and structured tasks. The talk covers capturing richer user intent through examples, enabling collaborative interaction, and using automated inspection for iterative refinement, illustrated with real-world applications from Microsoft.

Nov 24, 2025

Fully Connected 2025 kickoff: The rise (and the challenges) of the agentic era

Robin Bordoli of Weights & Biases explores AI's exponential growth, from past achievements to the current agentic landscape. He discusses the rise of reinforcement learning, the challenge of productionizing reliable agents, and highlights how foundational issues in AI development persist even as model capabilities soar.

Oct 06, 2025

Live from DevDay — the OpenAI Podcast Ep. 7

In a special live episode from OpenAI Dev Day, host Andrew Mayne interviews founders from SchoolAI, jam.dev, Abridge, and Cursor. They discuss how they are using AI to transform education, web development, healthcare, and coding, sharing insights on their product strategies, technical challenges, and excitement for the next wave of agent-based AI tools.

Sep 29, 2025

Beyond Chatbots: How to build Agentic AI systems with Google Gemini // Philipp Schmid

A deep dive into the evolution from static chatbots to dynamic, agentic AI systems. Philipp Schmid of Google DeepMind explores how to design, build, and evaluate AI agents that leverage structured outputs, function calling, and workflow orchestration with Google Gemini, covering key agentic patterns and the future of AI development.