Cost optimization

Scaling the Next Paradigm of Heterogeneous Intelligence — Adrian Bertagnoli, Callosum

Scaling the Next Paradigm of Heterogeneous Intelligence — Adrian Bertagnoli, Callosum

Adrian Bertagnoli from Callosum argues that the era of scaling monolithic models on homogeneous GPU clusters is ending. He introduces "heterogeneous intelligence," a new paradigm where model architectures, chip types, and workflows are optimized together. By routing subtasks to the most efficient model and hardware, this approach achieves significant performance gains, as demonstrated by two key results: a 7x cost reduction in recursive reasoning tasks using Cerebras, and state-of-the-art performance on the Video Web Arena benchmark, outperforming leading GPT and Gemini models at a fraction of the cost and time.

Beyond the Basics: Production Serverless Patterns for Extreme Scale • Janak Agarwal • GOTO 2025

Beyond the Basics: Production Serverless Patterns for Extreme Scale • Janak Agarwal • GOTO 2025

This presentation by Janak Agarwal from AWS provides a deep dive into scaling serverless applications for mission-critical, high-traffic workloads. It explores AWS Lambda's rapid scaling capabilities for handling extreme traffic bursts and introduces advanced patterns like Provisioned Concurrency for cost optimization during steady-state operations.

How We Cut LLM Latency 70% With TensorRT in Production

How We Cut LLM Latency 70% With TensorRT in Production

An engineering leader details the journey of self-hosting LLMs at enterprise scale, covering how his team slashed latency by 70% with TensorRT-LLM, optimized GPU costs through counterintuitive scaling, and built a verticalized AI platform for HR tech. The summary explores practical solutions for cold starts, KV cache optimization, and managing the cultural adoption of AI coding agents in engineering teams.

The Future of Search: Agents, RAG, and Why Retrieval Still Matters — Simon Eskildsen, Turbopuffer

The Future of Search: Agents, RAG, and Why Retrieval Still Matters — Simon Eskildsen, Turbopuffer

Simon Hørup Eskildsen, founder of turbopuffer, shares his journey from scaling Shopify's infrastructure to creating a new search engine for the AI era. He discusses how a prohibitively expensive experiment at Readwise inspired him to build a cost-effective vector search solution based on object storage and NVMe. Eskildsen breaks down turbopuffer's architecture, its role in cutting costs for companies like Cursor and Notion, his philosophy on building a 'P99' engineering team, and how agentic workloads are changing the future of retrieval.

Underwriting Assist - A Multi Agent System // Somya Rai | Maria Zhang // Agents in Production 2025

Underwriting Assist - A Multi Agent System // Somya Rai | Maria Zhang // Agents in Production 2025

Maria Zhang, CEO of Palona AI, and Somya Rai, Principal AI Engineer at EXL, discuss the architecture, scaling, memory management, and cost optimization of multi-agent systems in their respective domains of restaurants and insurance. They explore practical challenges, such as real-world bottlenecks and regulatory compliance, and share their technical stacks, including LangGraph, Ray, and NVIDIA platforms, for building robust and efficient agentic solutions.

Advancing the Cost-Quality Frontier in Agentic AI // Krista Opsahl-Ong // Agents in Production 2025

Advancing the Cost-Quality Frontier in Agentic AI // Krista Opsahl-Ong // Agents in Production 2025

Krista Opsahl-Ong from Databricks introduces Agent Bricks, a platform designed to overcome the key challenges of productionizing enterprise AI agents. The talk covers common use cases, the difficult trade-offs between cost and quality, and how Agent Bricks uses automated evaluation and advanced optimization techniques to build cost-effective, high-performance agents.