Tool use

Stop Making Models Bigger, Make Them Behave — Kobie Crawdord, Snorkel

Stop Making Models Bigger, Make Them Behave — Kobie Crawdord, Snorkel

Snorkel.ai's research demonstrates how a 4-billion-parameter model, fine-tuned with Reinforcement Learning for under $500, significantly outperformed a 235-billion-parameter model on financial analysis tool-use tasks. The key was cultivating 'tool discipline' and error correction capabilities, rather than relying on sheer model size or deeper reasoning. Single-table training generalized effectively to harder multi-table problems, emphasizing the importance of targeted behavioral fixes identified through detailed evaluation rubrics.

Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents — Michael Hablich, Google

Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents — Michael Hablich, Google

Michael Hablich from the Chrome DevTools team shares hard-won engineering lessons on building effective and secure interfaces for AI agents. The talk covers moving from raw data to semantic summaries, measuring interface efficiency with 'tokens per successful outcome', designing for error recovery, and the critical importance of trust boundaries and deliberate friction in UI design for agents.

SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

Ibragim Badertdinov from Nebius AI shares lessons from building and maintaining SWE-ReBench, a monthly leaderboard that evaluates coding agents on fresh, real-world software engineering tasks. The talk covers the anatomy of a good benchmark task, the challenges of filtering out noisy or flawed problems, and fascinating examples of how advanced models like Claude Code "cheat" by exploiting the environment. Finally, it explains how the same pipeline used for evaluation has produced large-scale, high-quality training datasets like SWE-bench, used by frontier AI labs.

Give Your Agent a Computer — Nico Albanese, Vercel

Give Your Agent a Computer — Nico Albanese, Vercel

Nico Albanese from Vercel demonstrates how to build a stateful, learning AI agent from scratch using AI SDK v6. The workshop covers the core components: a tool loop, provider-executed tools like web search, end-to-end type safety, and Vercel's new persistent named sandboxes, which give the agent a file system to persist state, memory, and even self-generated tools across sessions.

A Piece of Pi: Embedding The OpenClaw Coding Agent In Your Product — Matthias Luebken, Tavon

A Piece of Pi: Embedding The OpenClaw Coding Agent In Your Product — Matthias Luebken, Tavon

Matthias Luebken explains the core principle of building with coding agents: make things easy for them. This talk deconstructs the Pi SDK, showing how a simple loop of an LLM calling CLI tools can lead to emergent capabilities. Luebken presents a real-world B2B sales pipeline built on this principle, where agents handle incoming emails, query CRM/ERP data via simple tools, and generate draft responses, keeping the human in their familiar email client.

Agentic Search for Context Engineering — Leonie Monigatti, Elastic

Agentic Search for Context Engineering — Leonie Monigatti, Elastic

Leonie Monigatti from Elastic provides a practical guide to agentic search, arguing that effective context engineering is not just a retrieval problem, but a search problem. The workshop explores the trade-offs between specialized tools (like semantic search) and general-purpose tools (like shell and SQL execution), offering a "low floor, high ceiling" framework for building a robust and efficient retrieval stack for AI agents.