DPO

Jun 16, 2026

The State of Frontier Post-Training Recipes | Conversation with Finbarr Timbers

This discussion with Finbarr Timbers reviews the evolution of frontier post-training recipes, highlighting the shift from simpler SFT-DPO-RL to complex multi-teacher on-policy distillation (MOPD). It covers the organizational challenges of building models like Olmo, the rise of synthetic data and reasoning-focused RL in DeepSeek, and the complexities of integrating expert teachers, while also exploring open questions on environments, specialized APIs, and career strategies in the rapidly changing AI landscape.

Jan 08, 2026

Post-training best-in-class models in 2025

An expert overview of post-training techniques for language models, covering the entire workflow from data generation and curation to advanced algorithms like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning (RL), along with practical advice on evaluation and iteration.

Dec 10, 2025

How We Built a Leading Reasoning Model (Olmo 3)

A comprehensive overview of the entire process behind building Olmo 3 Think, covering the full stack from pre-training architecture and data selection to the detailed post-training recipe involving SFT, DPO, and a deep dive into the advanced infrastructure for scaling Reinforcement Learning (RL). The summary also includes critical reflections on the challenges and nuances of evaluating modern reasoning models.