Latency reduction

You Might Not Need 50 Diffusion Steps — Ziv Ilan, Nvidia

You Might Not Need 50 Diffusion Steps — Ziv Ilan, Nvidia

Ziv Ilan from NVIDIA details how latency in video diffusion models can be drastically reduced to achieve real-time generation. He presents a layered approach combining dynamic quantization for memory and speed, chunk-based caching to skip redundant denoising computations, and, most critically, step distillation—training models to achieve high-quality output in significantly fewer steps. These techniques, packaged in the open-source FastGen repository, offer additive performance gains, enabling real-time video on a single Blackwell B200 GPU.

LLM Compression Explained: Build Faster, Efficient AI Models

LLM Compression Explained: Build Faster, Efficient AI Models

Learn how AI model compression and quantization techniques are essential for optimizing Large Language Model (LLM) performance and significantly reducing inference costs in production. This deep dive covers practical examples, benefits like reduced latency and increased throughput, and strategies for different AI use cases, demonstrating how to deploy scalable AI with minimal accuracy degradation.