Model Evaluation

Jan 09, 2026

Collaborative AI Agents At OpenAI

Robert from OpenAI discusses the critical role of structured evaluations (evals) and graders for developing advanced collaborative agents. He explores the limitations of 'vibe-based' assessments, introduces a maturity model for evals, and presents a comprehensive rubric for measuring agent performance beyond simple accuracy, connecting these concepts to the power of Reinforcement Fine-Tuning (RFT).

Jan 09, 2026

KDD '25 AI Reasoning Day keynote: Improving AI Reasoning through Intent, Interaction, and Inspection

A deep dive into practical strategies for improving AI reasoning in code and structured tasks. The talk covers capturing richer user intent through examples, enabling collaborative interaction, and using automated inspection for iterative refinement, illustrated with real-world applications from Microsoft.

Nov 24, 2025

Fully Connected 2025 kickoff: The rise (and the challenges) of the agentic era

Robin Bordoli of Weights & Biases explores AI's exponential growth, from past achievements to the current agentic landscape. He discusses the rise of reinforcement learning, the challenge of productionizing reliable agents, and highlights how foundational issues in AI development persist even as model capabilities soar.

Sep 29, 2025

Beyond Chatbots: How to build Agentic AI systems with Google Gemini // Philipp Schmid

A deep dive into the evolution from static chatbots to dynamic, agentic AI systems. Philipp Schmid of Google DeepMind explores how to design, build, and evaluate AI agents that leverage structured outputs, function calling, and workflow orchestration with Google Gemini, covering key agentic patterns and the future of AI development.

Aug 14, 2025

Traditional vs LLM Recommender Systems: Are They Worth It?

This summary explores Arpita Vats's insights on how Large Language Models (LLMs) are revolutionizing recommender systems. It contrasts the traditional feature-engineering-heavy approach with the contextual understanding of LLMs, which shifts the focus to prompt engineering. Key challenges like inference latency and cost are discussed, along with practical solutions such as lightweight models, knowledge distillation, and hybrid architectures. The conversation also touches on advanced applications like sequential recommendation and the future potential of agentic AI.

Aug 08, 2025

Open AI Researchers Breakdown GPT-5

OpenAI researchers discuss the step-change in capabilities in ChatGPT-5, from coding and reasoning to creative writing. They detail the data-centric training processes, the shift toward asynchronous agentic workflows, and the future of AI development and its impact on the startup ecosystem.