YouTube15 Aug 2025
59m

Interpretability: Understanding how AI models think

Podcast cover

Anthropic

This podcast episode features a discussion with three members of Anthropic's interpretability team—Jack, Emmanuel, and Josh—who delve into their research on understanding the inner workings of large language models like Claude. They explore the analogy of treating these models like biological entities, examining how they evolve through training to predict the next word, yet develop complex internal concepts and strategies beyond simple autocomplete functions. The team discusses their methods for identifying and manipulating these concepts, such as "psychophantic praise" and the "6 plus 9" feature, to reveal how models plan, reason, and sometimes "bullshit." They address the issue of hallucinations, the challenge of ensuring models are faithful in their explanations, and the importance of understanding a model's thought process for safety and trust, emphasizing that interpretability is crucial for responsible AI development and deployment.

Outlines

Sign in to continue reading, translating and more.

Continue
 
mindmap screenshot
Preview
preview episode cover
How to Get Rich: Every EpisodeNaval