Synthetic data

The State of Frontier Post-Training Recipes | Conversation with Finbarr Timbers

The State of Frontier Post-Training Recipes | Conversation with Finbarr Timbers

This discussion with Finbarr Timbers reviews the evolution of frontier post-training recipes, highlighting the shift from simpler SFT-DPO-RL to complex multi-teacher on-policy distillation (MOPD). It covers the organizational challenges of building models like Olmo, the rise of synthetic data and reasoning-focused RL in DeepSeek, and the complexities of integrating expert teachers, while also exploring open questions on environments, specialized APIs, and career strategies in the rapidly changing AI landscape.

MagenticLite is here: A full-stack agentic experience powered by Small Models

MagenticLite is here: A full-stack agentic experience powered by Small Models

Microsoft Research introduces MagenticLite, an agentic framework powered by two new small, open-weight models: Magentic Orchestrator for planning and coding, and Fara-1.5 for browser automation. The talk details the novel synthetic data generation techniques and training strategies used to achieve state-of-the-art performance in small models, enabling them to compete with much larger ones.

Memory in LLMs: Weights and Activations - Jack Morris, Cornell

Memory in LLMs: Weights and Activations - Jack Morris, Cornell

This talk explores the limitations of current methods for providing knowledge to LLMs, such as large context windows and Retrieval-Augmented Generation (RAG). The speaker argues that the future lies in training knowledge directly into the model's weights. This is achieved through a combination of generating large synthetic datasets from small amounts of source material and using parameter-efficient fine-tuning (PEFT) techniques like LoRA to avoid catastrophic forgetting. The goal is to create more capable, personalized, and efficient models by fundamentally altering how they store and access information.

1X NEO humanoid robot enters the home

1X NEO humanoid robot enters the home

Experts analyze the 1X NEO humanoid robot's real-world viability and data challenges, delve into the complex copyright dispute between Japan's IP holders and OpenAI's Sora 2, and dissect the strategic implications of the new OpenAI and AWS partnership for AI infrastructure and multi-cloud strategies.

The Startup Powering The Data Behind AGI

The Startup Powering The Data Behind AGI

Edwin Chen, founder and CEO of Surge AI, shares the company's origin story, its rapid, bootstrapped growth, and its research-driven philosophy on data. He critiques traditional data labeling, explains why metrics like inter-annotator agreement fail for complex tasks, and offers a sharp analysis of benchmark hacking. Chen also details the future of data, from multimodal and agentic reasoning in rich RL environments to the need for hyper-specialized expertise for scientific discovery.

How Grounded Synthetic Data is Saving the Publishing Industry // Robert Caulk

How Grounded Synthetic Data is Saving the Publishing Industry // Robert Caulk

Robert from Emergent Methods discusses how grounded synthetic news data can solve the publisher revenue crisis in the AI era. He details the process of 'Context Engineering' news into token-optimized, objective data for high-stakes AI agent tasks, covering their open-source models for entity extraction and bias mitigation, and the on-premise infrastructure that protects publisher content.