Interpretability: Understanding how AI models think
Members of Anthropic's interpretability team discuss their research into the inner workings of large language models. They explore the analogy of studying AI as a biological system, the surprising discovery of internal "features" or concepts, and why this research is critical for understanding model behavior like hallucinations, sycophancy, and long-term planning, ultimately aiming to ensure AI safety.