Interpretability

Ideas: Community building, machine learning, and the future of AI

Ideas: Community building, machine learning, and the future of AI

Jenn Wortman Vaughan and Hanna Wallach, co-founders of the Women in Machine Learning (WiML) workshop, reflect on their intersecting careers, the founding and evolution of WiML over 20 years, and their influential research in responsible AI, from interpretability and fairness to the current challenges in generative AI.

Inside the AI Black Box

Inside the AI Black Box

Emmanuel Ameisen of Anthropic's interpretability team explains the inner workings of LLMs, drawing analogies to biology. He covers surprising findings on how models plan, represent concepts across languages, and the mechanistic causes of hallucinations, offering practical advice for developers on evaluation and post-training strategies.

Interpretability: Understanding how AI models think

Interpretability: Understanding how AI models think

Members of Anthropic's interpretability team discuss their research into the inner workings of large language models. They explore the analogy of studying AI as a biological system, the surprising discovery of internal "features" or concepts, and why this research is critical for understanding model behavior like hallucinations, sycophancy, and long-term planning, ultimately aiming to ensure AI safety.

Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Mapping the Mind of a Neural Net: Goodfire’s Eric Ho on the Future of Interpretability

Eric Ho, founder of Goodfire, discusses the critical challenge of AI interpretability. He shares how his team is developing techniques to understand, audit, and edit neural networks at the feature level, including breakthrough results in resolving superposition with sparse autoencoders, successful model editing demonstrations, and real-world applications in genomics with Arc Institute's DNA foundation models. Ho argues that these white-box approaches are essential for building safe, reliable, and intentionally designed AI systems.