Inference

Flipping the Inference Stack — Robert Wachen, Etched

Flipping the Inference Stack — Robert Wachen, Etched

The current AI inference stack, reliant on general-purpose GPUs, is economically and technically unsustainable for real-time AI at scale. AI hardware expert Robert Wachen argues that the future is specialized hardware, like Transformer-specific ASICs, which can unlock currently bottlenecked applications such as real-time video, code generation, and large-scale enterprise deployments by solving critical latency and cost-per-user challenges.

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

A deep dive into SGLang, an open-source serving framework for LLMs. This summary covers its core features, history, performance optimization techniques like CUDA Graph and Eagle 3 speculative decoding, and how to contribute to the project.

How DeepL Built a Translation Powerhouse with AI with CEO Jarek Kutylowski

How DeepL Built a Translation Powerhouse with AI with CEO Jarek Kutylowski

Jarek Kutylowski, CEO of DeepL, discusses the company's technical strategy for competing with large language models in the translation space. He covers their focus on specialized model architectures, the critical role of curated data, the engineering challenges of building custom GPU data centers and large-scale inference systems, and the future of AI-driven translation in enterprise workflows.