06 Jun 2025
1h 53m

The Utility of Interpretability — Emmanuel Amiesen

Podcast cover

Latent Space: The AI Engineer Podcast

In this panel/co-hosted podcast episode, Swyx, Vibhu, and Emmanuel Amiesen from Anthropic discuss the latest MechInterp work, specifically focusing on circuit tracing and interpretability in language models. Emmanuel details the recent release of code that allows users to explore and experiment with open-source models like Gemma, explaining how to trace a model's computation when predicting a token. The conversation covers open questions in the field, ways to contribute, and the significance of understanding model internals for safety and improvement, including the superposition hypothesis, sparse autoencoders, and the creation of interpretable models. They also explore practical applications like steering model behavior and investigating jailbreaks, and touch on the importance of high-quality data visualization for communicating complex research.

Outlines

Part 1: Introduction to Circuit Tracing

Part 2: MechInterp Fundamentals

Part 3: Model Mechanisms and Reasoning

Part 4: Challenges and Future Directions

Sign in to continue reading, translating and more.

Continue
 
mindmap screenshot
Preview
preview episode cover
How to Get Rich: Every EpisodeNaval