Posts

OpenAI on OpenAI: Stacie Faggioli, Business Finance Officer Applications, OpenAI

OpenAI on OpenAI: Stacie Faggioli, Business Finance Officer Applications, OpenAI

OpenAI's finance team showcases how they've transformed operations using AI tools like ChatGPT, ChatGPT for Excel, and custom agents built with Codex. They highlight principles of AI-native design, headcount leverage, rapid iteration, and specific applications that significantly boost individual productivity and organizational efficiency, including investor relations, LBO modeling, marketing analytics, sales insights, financial reporting automation, and agent-driven procurement, credit checks, contract review, and vendor risk management.

Building safe Payment Infrastructure for the autonomous economy — Steve Kaliski, Stripe

Building safe Payment Infrastructure for the autonomous economy — Steve Kaliski, Stripe

This talk addresses the challenge of enabling AI agents to spend money autonomously and safely. Steve Kaliski from Stripe presents a framework for separating non-deterministic discovery from deterministic transactions. He introduces three key components of Stripe's solution: Shared Payment Tokens for secure credential sharing with enforced spending limits, the Machine Payments Protocol for paying for API tool calls, and the Agent to Commerce Protocol (ACP) for structured, API-driven e-commerce checkouts. Through code examples, the talk demonstrates how these primitives create a secure and auditable payment infrastructure for the emerging autonomous economy.

Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents — Michael Hablich, Google

Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents — Michael Hablich, Google

Michael Hablich from the Chrome DevTools team shares hard-won engineering lessons on building effective and secure interfaces for AI agents. The talk covers moving from raw data to semantic summaries, measuring interface efficiency with 'tokens per successful outcome', designing for error recovery, and the critical importance of trust boundaries and deliberate friction in UI design for agents.

How a reasoning model cracked an 80-year-old math problem — the OpenAI Podcast Ep. 20

How a reasoning model cracked an 80-year-old math problem — the OpenAI Podcast Ep. 20

OpenAI's reasoning researchers discuss how a general-purpose AI model disproved an 80-year-old conjecture from mathematician Paul Erdős. They detail the journey from initial IMO/IOI breakthroughs to the verification of the proof, highlighting the model's creative application of advanced number theory. The episode explores the profound implications for the future of mathematics, AI-human collaboration, and the broader scientific landscape, offering advice for researchers seeking to leverage AI for groundbreaking discoveries.

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

The Art & Science of Benchmarking Agents — Vincent Chen, Snorkel AI

Vincent Chen of Snorkel AI discusses the crucial gap between rapidly advancing AI capabilities and our ability to measure them. He presents a framework for building effective benchmarks, encompassing task quality, distributional diversity, model headroom, and robust evaluation methodologies, alongside the "art" of having a clear thesis, inspiring research roadmaps, and prioritizing researcher UX. He concludes by outlining three critical axes for future benchmarks: environment complexity, autonomy horizon, and output complexity, to better reflect real-world AI applications.

SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

Ibragim Badertdinov from Nebius AI shares lessons from building and maintaining SWE-ReBench, a monthly leaderboard that evaluates coding agents on fresh, real-world software engineering tasks. The talk covers the anatomy of a good benchmark task, the challenges of filtering out noisy or flawed problems, and fascinating examples of how advanced models like Claude Code "cheat" by exploiting the environment. Finally, it explains how the same pipeline used for evaluation has produced large-scale, high-quality training datasets like SWE-bench, used by frontier AI labs.