Llm

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Iterating on Your AI Evals // Mariana Prazeres // Agents in Production 2025

Moving an AI agent from a promising demo to a reliable product is challenging. This talk presents a startup-friendly, iterative process for building robust evaluation frameworks, emphasizing that you must iterate on the evaluations themselves—the metrics and the data—not just the prompts and models. It outlines a practical "crawl, walk, run" approach, starting with simple heuristics and scaling to an advanced system with automated checks and human-in-the-loop validation.

Building an Agentic Platform — Ben Kus, CTO Box

Building an Agentic Platform — Ben Kus, CTO Box

Ben Kus, CTO of Box, outlines the technical evolution of their AI platform, detailing the transition from a promising but fragile LLM-based metadata extraction system to a robust, scalable agentic architecture. He explains why this shift was necessary to handle enterprise-level complexity and the key lessons learned.

Five hard earned lessons about Evals — Ankur Goyal, Braintrust

Five hard earned lessons about Evals — Ankur Goyal, Braintrust

Building successful AI applications requires a sophisticated engineering approach that goes beyond prompt engineering. This involves creating intentionally engineered evaluations (evals) that reflect user feedback, focusing on "context engineering" to optimize tool definitions and outputs, and maintaining a flexible, model-agnostic architecture to adapt to the rapidly evolving AI landscape.

How BlackRock Builds Custom Knowledge Apps at Scale — Vaibhav Page & Infant Vasanth, BlackRock

How BlackRock Builds Custom Knowledge Apps at Scale — Vaibhav Page & Infant Vasanth, BlackRock

BlackRock engineers Vaibhav Page and Infant Vasanth introduce a modular, Kubernetes-native AI framework designed to accelerate the development of custom knowledge applications for investment operations, reducing deployment time from months to days.

Interpretability: Understanding how AI models think

Interpretability: Understanding how AI models think

Members of Anthropic's interpretability team discuss their research into the inner workings of large language models. They explore the analogy of studying AI as a biological system, the surprising discovery of internal "features" or concepts, and why this research is critical for understanding model behavior like hallucinations, sycophancy, and long-term planning, ultimately aiming to ensure AI safety.

This Week in AI: GPT-5 Ships, 4o Pulled Back, Grok Imagine Goes Social

This Week in AI: GPT-5 Ships, 4o Pulled Back, Grok Imagine Goes Social

Partners Olivia and Justine Moore discuss the latest in consumer AI, including Grok's uniquely social and fast image generation, Google's interactive world model Genie 3, the user backlash to GPT-5's personality changes, ElevenLabs' licensed AI music model, and the emerging fragmentation of "vibecoding" platforms for technical and non-technical users.