We Can Monitor AI’s Thoughts… For Now | Google DeepMind's Neel Nanda | 80,000 Hours

In this interview, Neel Nanda discusses mechanistic interpretability (MechInterp), a research project focused on understanding how AI models work internally, and its potential role in ensuring the safe development and deployment of artificial general intelligence (AGI). Nanda reflects on his evolving perspective, moving from idealistic ambition to optimistic pragmatism, acknowledging the complexities and messiness involved in fully understanding AI models. He emphasizes the importance of using the internals of a model to understand it, advocates for a portfolio of safety measures rather than relying on a single "silver bullet," and highlights the value of simple, cheap techniques like probes for monitoring model behavior. Nanda also addresses the limitations of MechInterp, including challenges in identifying deception and the potential for models to evolve beyond human understanding, while underscoring the need for continued investment and a task-focused approach to research.

Outlines

Part 1: Introduction and Definition

Part 2: Successes and Challenges

Part 3: Limitations and Broader View

Part 4: Chain of Thought

Part 5: Objections and Recursive Self-Improvement

Part 6: Sparse Autoencoders (SAEs)

Part 7: Hype, Diagnostics, and Research Philosophy

Part 8: Career Advice

Sign in to continue reading, translating and more.

Continue

We Can Monitor AI’s Thoughts… For Now | Google DeepMind's Neel Nanda

80,000 Hours

Part 1: Introduction and Definition

Introduction to Mechanistic Interpretability and its Role in AGI Safety

Defining Mechanistic Interpretability and its Necessity

Neel Nanda's Evolving Perspective on Mechanistic Interpretability

Part 2: Successes and Challenges

Successes in Mechanistic Interpretability: Auditing Games and Detecting Harmful Intentions

Probes, Linear Representations, and Advantages over Neuroscience

Challenges in Mechanistic Interpretability: Unexpected Concepts and Messiness

Structural Challenges and the Difficulty of Avoiding Self-Deception in Research

Part 3: Limitations and Broader View

The Limits of Interpretability in Finding Deceptive AI

Black Box Interpretability and the Broader View of Understanding AI Systems

Investigating Self-Preservation Behavior and the Importance of Moral Constraints

Part 4: Chain of Thought

Chain of Thought: Trustworthiness and Utility

The Future of Chain of Thought and Governance Implications

Models Detecting Evaluations and the Difficulty of Evading MechInterp

Part 5: Objections and Recursive Self-Improvement

Objections to MechInterp: Level of Analysis and Granularity

Recursive Self-Improvement and the Future of MechInterp

Part 6: Sparse Autoencoders (SAEs)

Introduction to Sparse Autoencoders (SAEs)

Limitations of Sparse Autoencoders and Deprioritizing SAE Research

Evaluating Sparse Autoencoders on Real-World Tasks

The Future of Sparse Autoencoders and the Importance of Task-Focused Research

Part 7: Hype, Diagnostics, and Research Philosophy

The Hype Around MechInterp and the Importance of Probes

Diagnostics vs. Control and the Role of Understanding

Neel Nanda's Research Philosophy: Simplicity, Downstream Tasks, and Skepticism

Downstream Tasks and the Importance of Objective Measurement

Part 8: Career Advice

Career Advice for Aspiring MechInterp Researchers

Skills and Resources for Getting Started in MechInterp

Staying Up-to-Date and Job Opportunities in MechInterp

We Can Monitor AI’s Thoughts… For Now | Google DeepMind's Neel Nanda

80,000 Hours

Part 1: Introduction and Definition

00:00Introduction to Mechanistic Interpretability and its Role in AGI Safety

Introduction to Mechanistic Interpretability and its Role in AGI Safety

05:13Defining Mechanistic Interpretability and its Necessity

Defining Mechanistic Interpretability and its Necessity

09:50Neel Nanda's Evolving Perspective on Mechanistic Interpretability

Neel Nanda's Evolving Perspective on Mechanistic Interpretability

Part 2: Successes and Challenges

16:00Successes in Mechanistic Interpretability: Auditing Games and Detecting Harmful Intentions

Successes in Mechanistic Interpretability: Auditing Games and Detecting Harmful Intentions

23:27Probes, Linear Representations, and Advantages over Neuroscience

Probes, Linear Representations, and Advantages over Neuroscience

30:21Challenges in Mechanistic Interpretability: Unexpected Concepts and Messiness

Challenges in Mechanistic Interpretability: Unexpected Concepts and Messiness

37:54Structural Challenges and the Difficulty of Avoiding Self-Deception in Research

Structural Challenges and the Difficulty of Avoiding Self-Deception in Research

Part 3: Limitations and Broader View

44:40The Limits of Interpretability in Finding Deceptive AI

The Limits of Interpretability in Finding Deceptive AI

50:13Black Box Interpretability and the Broader View of Understanding AI Systems

Black Box Interpretability and the Broader View of Understanding AI Systems

55:08Investigating Self-Preservation Behavior and the Importance of Moral Constraints

Investigating Self-Preservation Behavior and the Importance of Moral Constraints

Part 4: Chain of Thought

1:02:26Chain of Thought: Trustworthiness and Utility

Chain of Thought: Trustworthiness and Utility

1:11:15The Future of Chain of Thought and Governance Implications

The Future of Chain of Thought and Governance Implications

1:17:15Models Detecting Evaluations and the Difficulty of Evading MechInterp

Models Detecting Evaluations and the Difficulty of Evading MechInterp

Part 5: Objections and Recursive Self-Improvement

1:23:07Objections to MechInterp: Level of Analysis and Granularity

Objections to MechInterp: Level of Analysis and Granularity

1:27:55Recursive Self-Improvement and the Future of MechInterp

Recursive Self-Improvement and the Future of MechInterp

Part 6: Sparse Autoencoders (SAEs)

1:37:40Introduction to Sparse Autoencoders (SAEs)

Introduction to Sparse Autoencoders (SAEs)

1:47:55Limitations of Sparse Autoencoders and Deprioritizing SAE Research

Limitations of Sparse Autoencoders and Deprioritizing SAE Research

1:55:57Evaluating Sparse Autoencoders on Real-World Tasks

Evaluating Sparse Autoencoders on Real-World Tasks

2:04:45The Future of Sparse Autoencoders and the Importance of Task-Focused Research

The Future of Sparse Autoencoders and the Importance of Task-Focused Research

Part 7: Hype, Diagnostics, and Research Philosophy

2:13:43The Hype Around MechInterp and the Importance of Probes

The Hype Around MechInterp and the Importance of Probes

2:22:13Diagnostics vs. Control and the Role of Understanding

Diagnostics vs. Control and the Role of Understanding

2:27:53Neel Nanda's Research Philosophy: Simplicity, Downstream Tasks, and Skepticism

Neel Nanda's Research Philosophy: Simplicity, Downstream Tasks, and Skepticism

2:35:05Downstream Tasks and the Importance of Objective Measurement

Downstream Tasks and the Importance of Objective Measurement

Part 8: Career Advice

2:43:09Career Advice for Aspiring MechInterp Researchers

Career Advice for Aspiring MechInterp Researchers

2:53:03Skills and Resources for Getting Started in MechInterp

Skills and Resources for Getting Started in MechInterp

3:00:54Staying Up-to-Date and Job Opportunities in MechInterp

Staying Up-to-Date and Job Opportunities in MechInterp