Mlops

Why Language Models Need a Lesson in Education

Why Language Models Need a Lesson in Education

Stephanie Kirmer, a staff machine learning engineer at DataGrail, adapts her experience as a former professor to address the challenge of evaluating LLMs in production. She proposes a robust methodology using LLM-based evaluators guided by rigorous, human-calibrated rubrics to bring objectivity and scalability to the subjective task of assessing text generation quality.

EDD: The Science of Improving AI Agents // Shahul Elavakkattil Shereef // Agents in Production 2025

EDD: The Science of Improving AI Agents // Shahul Elavakkattil Shereef // Agents in Production 2025

This talk introduces Eval-Driven Development (EDD) as a scientific alternative to 'vibe-based' iteration for improving AI agents. It covers quantitative evaluation (choosing strong end-to-end metrics, aligning LLM judges) and qualitative evaluation (using error and attribution analysis to debug failures), providing a structured framework for consistent agent improvement.

When Agents Hire Their Own Team: Inside Hypermode’s Concierge // Ryan Fox-Tyler

When Agents Hire Their Own Team: Inside Hypermode’s Concierge // Ryan Fox-Tyler

Ryan Fox-Tyler from Hypermode explains their philosophy of empowering AI agents to design and deploy other agents. He introduces Concierge, an agent that builds other agents, and details the underlying actor-based runtime built for scalability, fault tolerance, and efficient, event-driven execution of thousands of parallel agent instances.

The Truth About LLM Training

The Truth About LLM Training

Paul van der Boor and Zulkuf Genc from Prosus discuss the practical realities of deploying AI agents in production. They cover their in-house evaluation framework, strategies for navigating the GPU market, the importance of fine-tuning over building from scratch, and how they use AI to analyze usage patterns in a privacy-preserving manner.

The Hidden Bottlenecks Slowing Down AI Agents

The Hidden Bottlenecks Slowing Down AI Agents

Paul van der Boor and Bruce Martens from Prosus discuss the real bottlenecks in AI agent development, arguing that the primary challenges are not tools, but rather evaluation, data quality, and feedback loops. They detail their 'buy-first' philosophy, the practical reasons they often build in-house, and how new coding agents like Devon and Cursor are changing their development workflows.

Enterprise AI Adoption Challenges

Enterprise AI Adoption Challenges

Paul van der Boor and Sean Kenny from Prosus detail the journey of Toqan, an internal AI platform that evolved from a Slack experiment into a sophisticated agentic system. They share insights on driving enterprise adoption, key metrics for measuring productivity, and their future vision of an "AI Workforce" where employees architect AI agents to automate complex, cross-system tasks.