Ai evaluation

The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks

The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks

Sandipan Bhaumik presents a five-pillar framework for successfully moving AI systems from demos to production, inspired by a retail bank's failed chatbot PoC. The framework covers defining numerical success (Evaluation), tracing every AI decision (Observability), building robust data pipelines (Data Foundation), managing multiple AI interactions (Multi-agent Orchestration), and ensuring accountability and security (Governance). He illustrates these concepts with a banking chatbot case study, emphasizing continuous evaluation, data quality, and a proactive incident playbook.

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Nicholas Kang and Michael Aaron from Google DeepMind's Kaggle team discuss the broken state of AI evaluations—scattered, non-transparent, and created by a homogenous group. They present their solutions: a community-driven benchmarks platform, a PvP Game Arena for non-saturating ELO ratings, standardized agent exams, and hackathons to crowdsource novel evals and address the limitations of current benchmarking practices.

Dark Factory: How OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc

Dark Factory: How OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc

Vincent Koc argues that static benchmarks are failing in the era of adaptive AI. He proposes a shift from static testing to 'malleable evals,' where agents self-optimize and curate their own test suites based on user intent and production data, treating evaluation as a living, evolving system.

Are AI Benchmarks Telling The Full Story? [SPONSORED]

Are AI Benchmarks Telling The Full Story? [SPONSORED]

AI models are often benchmarked like Formula 1 cars, excelling on technical exams but failing the test of daily human experience. Researchers Andrew Gordon and Nora Petrova from Prolific critique the 'leaderboard illusion' of current ranking systems and introduce their HUMAINE leaderboard, a new framework that uses census-based sampling and the TrueSkill algorithm to measure how helpful, safe, and relatable models are to real people, not just tech enthusiasts.

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

The 100-person lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)

Edwin Chen, founder and CEO of Surge AI, discusses his contrarian, bootstrapped approach to building a billion-dollar company, the critical role of high-quality data and 'taste' in training advanced AI models, the pitfalls of current benchmarks, and why Reinforcement Learning environments are the next frontier in AI.

Ideas: Community building, machine learning, and the future of AI

Ideas: Community building, machine learning, and the future of AI

Co-founders Jenn Wortman Vaughan and Hanna Wallach reflect on 20 years of the Women in Machine Learning (WiML) workshop, discussing its origins, their parallel careers in responsible AI, and the future challenges of evaluating generative AI and fostering critical thought.