Distributed training

Accelerating Growth Through Optimizing GPU Usage // Sahil Khanna // AI in Production 2025

Accelerating Growth Through Optimizing GPU Usage // Sahil Khanna // AI in Production 2025

Adobe's journey in building a sophisticated AI Compute Platform to tackle the immense challenges of GPU optimization for training large-scale generative models like Firefly. The talk covers their custom-built solutions for resource management, developer productivity, and automated fault tolerance.

Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025

Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025

Guanhua Wang from Microsoft's DeepSpeed team explains ZeRO++, a system that tackles the communication bottleneck in large-scale LLM training. By quantizing weights and gradients, ZeRO++ reduces communication volume by 4x, leading to training speedups of over 2x, particularly in low-bandwidth and small-batch-size environments.

Dion: The distributed orthonormal update revolution is here

Dion: The distributed orthonormal update revolution is here

Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.