Distributed training

Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025

Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025

Guanhua Wang from Microsoft's DeepSpeed team explains ZeRO++, a system that tackles the communication bottleneck in large-scale LLM training. By quantizing weights and gradients, ZeRO++ reduces communication volume by 4x, leading to training speedups of over 2x, particularly in low-bandwidth and small-batch-size environments.

Dion: The distributed orthonormal update revolution is here

Dion: The distributed orthonormal update revolution is here

Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.