YouTube08 Sep 2025

We Can Monitor AI’s Thoughts… For Now | Google DeepMind's Neel Nanda

Podcast cover

80,000 Hours

In this interview, Neel Nanda discusses mechanistic interpretability (MechInterp), a research project focused on understanding how AI models work internally, and its potential role in ensuring the safe development and deployment of artificial general intelligence (AGI). Nanda reflects on his evolving perspective, moving from idealistic ambition to optimistic pragmatism, acknowledging the complexities and messiness involved in fully understanding AI models. He emphasizes the importance of using the internals of a model to understand it, advocates for a portfolio of safety measures rather than relying on a single "silver bullet," and highlights the value of simple, cheap techniques like probes for monitoring model behavior. Nanda also addresses the limitations of MechInterp, including challenges in identifying deception and the potential for models to evolve beyond human understanding, while underscoring the need for continued investment and a task-focused approach to research.

Outlines

Part 1: Introduction and Definition

Part 2: Successes and Challenges

Part 3: Limitations and Broader View

Part 4: Chain of Thought

Part 5: Objections and Recursive Self-Improvement

Part 6: Sparse Autoencoders (SAEs)

Part 7: Hype, Diagnostics, and Research Philosophy

Part 8: Career Advice

Sign in to continue reading, translating and more.

Continue
 
mindmap screenshot
Preview
preview episode cover
How to Get Rich: Every EpisodeNaval