Distributed Training

Distributed training

May 26, 2026

How Cursor Trained Composer on Fireworks: Distributed Infrastructure for High-Performance RL

Cursor's Federico Cassano and Fireworks' Dmytro Dzhulgakov detail their collaboration on Composer 2, a specialized foundation model for software engineering. They discuss their top-down training strategy, the infrastructure challenges of large-scale distributed Reinforcement Learning on sparse models, and how model specialization achieves frontier performance with superior efficiency.

May 01, 2026

Granite 4.1, IBM Bob & building a quantum ecosystem

This episode of Mixture of Experts breaks down IBM's enterprise-focused Granite 4.1 and Project Bob, Google DeepMind's DiLoCo distributed training method, the inference-efficient DeepSeek V4 model, and IBM's strategy for achieving quantum advantage through strategic partnerships.

Mar 06, 2026

Efficient Distributed Orthonormal Optimizers for Large-Scale Training

Kwangjun Ahn from Microsoft Research provides a technical overview of orthonormal optimizers (like Muon and Dion2), a new class of algorithms for large-scale AI model training that are emerging as powerful successors to AdamW. The talk covers their theoretical foundations, empirical benefits, distributed implementation strategies, and practical guidelines for integration into modern training pipelines.

Jan 08, 2026

Accelerating Growth Through Optimizing GPU Usage // Sahil Khanna // AI in Production 2025

Adobe's journey in building a sophisticated AI Compute Platform to tackle the immense challenges of GPU optimization for training large-scale generative models like Firefly. The talk covers their custom-built solutions for resource management, developer productivity, and automated fault tolerance.

Sep 29, 2025

Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025

Guanhua Wang from Microsoft's DeepSpeed team explains ZeRO++, a system that tackles the communication bottleneck in large-scale LLM training. By quantizing weights and gradients, ZeRO++ reduces communication volume by 4x, leading to training speedups of over 2x, particularly in low-bandwidth and small-batch-size environments.

Sep 24, 2025

Dion: The distributed orthonormal update revolution is here

Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.