Model evaluation

Collaborative AI Agents At OpenAI

Collaborative AI Agents At OpenAI

Robert from OpenAI discusses the critical role of structured evaluations (evals) and graders for developing advanced collaborative agents. He explores the limitations of 'vibe-based' assessments, introduces a maturity model for evals, and presents a comprehensive rubric for measuring agent performance beyond simple accuracy, connecting these concepts to the power of Reinforcement Fine-Tuning (RFT).

KDD '25 AI Reasoning Day keynote: Improving AI Reasoning through Intent, Interaction, and Inspection

KDD '25 AI Reasoning Day keynote: Improving AI Reasoning through Intent, Interaction, and Inspection

A deep dive into practical strategies for improving AI reasoning in code and structured tasks. The talk covers capturing richer user intent through examples, enabling collaborative interaction, and using automated inspection for iterative refinement, illustrated with real-world applications from Microsoft.

Fully Connected 2025 kickoff: The rise (and the challenges) of the agentic era

Fully Connected 2025 kickoff: The rise (and the challenges) of the agentic era

Robin Bordoli of Weights & Biases explores AI's exponential growth, from past achievements to the current agentic landscape. He discusses the rise of reinforcement learning, the challenge of productionizing reliable agents, and highlights how foundational issues in AI development persist even as model capabilities soar.

Beyond Chatbots: How to build Agentic AI systems with Google Gemini // Philipp Schmid

Beyond Chatbots: How to build Agentic AI systems with Google Gemini // Philipp Schmid

A deep dive into the evolution from static chatbots to dynamic, agentic AI systems. Philipp Schmid of Google DeepMind explores how to design, build, and evaluate AI agents that leverage structured outputs, function calling, and workflow orchestration with Google Gemini, covering key agentic patterns and the future of AI development.

Traditional vs LLM Recommender Systems: Are They Worth It?

Traditional vs LLM Recommender Systems: Are They Worth It?

This summary explores Arpita Vats's insights on how Large Language Models (LLMs) are revolutionizing recommender systems. It contrasts the traditional feature-engineering-heavy approach with the contextual understanding of LLMs, which shifts the focus to prompt engineering. Key challenges like inference latency and cost are discussed, along with practical solutions such as lightweight models, knowledge distillation, and hybrid architectures. The conversation also touches on advanced applications like sequential recommendation and the future potential of agentic AI.

Open AI Researchers Breakdown GPT-5

Open AI Researchers Breakdown GPT-5

OpenAI researchers discuss the step-change in capabilities in ChatGPT-5, from coding and reasoning to creative writing. They detail the data-centric training processes, the shift toward asynchronous agentic workflows, and the future of AI development and its impact on the startup ecosystem.