Latency

The Latency Goldilocks Zone Explained

The Latency Goldilocks Zone Explained

Rafael Borger and Daniel Wolbert from iFood discuss the engineering and product strategy behind ILO-Agent, their conversational AI for 200 million users. They cover hyper-personalized recommendation systems, the "Latency Goldilocks Zone" where AI responses can be too fast for users to trust, and the architectural challenges of building multi-channel agents for text and voice.

Voice AI: when is the "Her" moment? — Neil Zeghidour, Gradium AI

Voice AI: when is the "Her" moment? — Neil Zeghidour, Gradium AI

Neil Zeghidour, CEO of Gradium AI, deconstructs the gap between current voice AI and the "Her" ideal. He argues that while cascaded systems are practical, they are architecturally flawed for natural conversation. The future lies in full-duplex, speech-to-speech models that not only solve latency but also integrate deep paralinguistic understanding and overcome significant cost barriers.

Build Hour: Prompt Caching

Build Hour: Prompt Caching

Explore prompt caching to significantly reduce latency and costs for your AI applications. This guide breaks down the mechanics of KV caching, best practices for maximizing cache hits using `prompt_cache_key` and the Responses API, and real-world implementation insights from the agentic development platform, Warp.

Inference at Scale:Breaking the Memory Wall

Inference at Scale:Breaking the Memory Wall

Sid Sheth, CEO of d-matrix, details their memory-centric approach to AI inference hardware, focusing on their Digital In-Memory Compute (DIMC) architecture. He explains how DIMC, an augmented SRAM technology, minimizes data movement to solve the memory bottleneck, delivering significant gains in latency and energy efficiency, particularly for the 'decode' phase of large language models.

Building Voice Agents Just Got Easier

Building Voice Agents Just Got Easier

Anoop Dawar from Deepgram discusses the evolution of voice AI, from basic transcription to sophisticated, real-time voice agents. He covers the key technical challenges in production, such as latency and interruption handling, and introduces Deepgram's Flux system. The talk concludes with a look at the future of speech-to-speech models that can understand emotional nuance, moving closer to passing the audio Turing Test.

Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

Sean DuBois (OpenAI, Pion) and Kwindla Hultman Kramer (Daily, Pipecat) argue that to build successful real-time AI applications, developers must start from the network layer up, prioritizing WebRTC over WebSockets to manage latency effectively and enable advanced features like interruption and state management.