Dpo

The State of Frontier Post-Training Recipes | Conversation with Finbarr Timbers

The State of Frontier Post-Training Recipes | Conversation with Finbarr Timbers

This discussion with Finbarr Timbers reviews the evolution of frontier post-training recipes, highlighting the shift from simpler SFT-DPO-RL to complex multi-teacher on-policy distillation (MOPD). It covers the organizational challenges of building models like Olmo, the rise of synthetic data and reasoning-focused RL in DeepSeek, and the complexities of integrating expert teachers, while also exploring open questions on environments, specialized APIs, and career strategies in the rapidly changing AI landscape.

Post-training best-in-class models in 2025

Post-training best-in-class models in 2025

An expert overview of post-training techniques for language models, covering the entire workflow from data generation and curation to advanced algorithms like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning (RL), along with practical advice on evaluation and iteration.

How We Built a Leading Reasoning Model (Olmo 3)

How We Built a Leading Reasoning Model (Olmo 3)

A comprehensive overview of the entire process behind building Olmo 3 Think, covering the full stack from pre-training architecture and data selection to the detailed post-training recipe involving SFT, DPO, and a deep dive into the advanced infrastructure for scaling Reinforcement Learning (RL). The summary also includes critical reflections on the challenges and nuances of evaluating modern reasoning models.