Large scale training

Efficient Distributed Orthonormal Optimizers for Large-Scale Training

Efficient Distributed Orthonormal Optimizers for Large-Scale Training

Kwangjun Ahn from Microsoft Research provides a technical overview of orthonormal optimizers (like Muon and Dion2), a new class of algorithms for large-scale AI model training that are emerging as powerful successors to AdamW. The talk covers their theoretical foundations, empirical benefits, distributed implementation strategies, and practical guidelines for integration into modern training pipelines.

Dion: The distributed orthonormal update revolution is here

Dion: The distributed orthonormal update revolution is here

Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.