Evals in Action: From Frontier Research to Production Applications
An overview of OpenAI's approach to AI evaluation, covering the GDP-val benchmark for frontier models and the practical tools available for developers to evaluate their own custom agents and applications.