Distributed training

How Cursor Trained Composer on Fireworks: Distributed Infrastructure for High-Performance RL

How Cursor Trained Composer on Fireworks: Distributed Infrastructure for High-Performance RL

Cursor's Federico Cassano and Fireworks' Dmytro Dzhulgakov detail their collaboration on Composer 2, a specialized foundation model for software engineering. They discuss their top-down training strategy, the infrastructure challenges of large-scale distributed Reinforcement Learning on sparse models, and how model specialization achieves frontier performance with superior efficiency.

Granite 4.1, IBM Bob & building a quantum ecosystem

Granite 4.1, IBM Bob & building a quantum ecosystem

This episode of Mixture of Experts breaks down IBM's enterprise-focused Granite 4.1 and Project Bob, Google DeepMind's DiLoCo distributed training method, the inference-efficient DeepSeek V4 model, and IBM's strategy for achieving quantum advantage through strategic partnerships.

Efficient Distributed Orthonormal Optimizers for Large-Scale Training

Efficient Distributed Orthonormal Optimizers for Large-Scale Training

Kwangjun Ahn from Microsoft Research provides a technical overview of orthonormal optimizers (like Muon and Dion2), a new class of algorithms for large-scale AI model training that are emerging as powerful successors to AdamW. The talk covers their theoretical foundations, empirical benefits, distributed implementation strategies, and practical guidelines for integration into modern training pipelines.

Accelerating Growth Through Optimizing GPU Usage // Sahil Khanna // AI in Production 2025

Accelerating Growth Through Optimizing GPU Usage // Sahil Khanna // AI in Production 2025

Adobe's journey in building a sophisticated AI Compute Platform to tackle the immense challenges of GPU optimization for training large-scale generative models like Firefly. The talk covers their custom-built solutions for resource management, developer productivity, and automated fault tolerance.

Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025

Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025

Guanhua Wang from Microsoft's DeepSpeed team explains ZeRO++, a system that tackles the communication bottleneck in large-scale LLM training. By quantizing weights and gradients, ZeRO++ reduces communication volume by 4x, leading to training speedups of over 2x, particularly in low-bandwidth and small-batch-size environments.

Dion: The distributed orthonormal update revolution is here

Dion: The distributed orthonormal update revolution is here

Kwangjun Ahn from Microsoft Research introduces Dion, a next-generation optimizer that improves upon Muon by using amortized power iteration. Dion enables efficient, scalable training for massive models by orthonormalizing a low-rank subspace, reducing compute and communication overhead in distributed settings.