Interpretability

The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.

Research teams:Alignment Economic Research Interpretability Societal Impacts

Safety through understanding

It's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.

Multidisciplinary approach

Some Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.

Natural Language Autoencoders: Turning Claude’s thoughts into text

InterpretabilityMay 7, 2026

AI models like Claude talk in words but think in numbers. In this study, we train Claude to translate its thoughts into human-readable text.

InterpretabilityApr 2, 2026

Interpretability

Safety through understanding

Multidisciplinary approach

Natural Language Autoencoders: Turning Claude’s thoughts into text

Emotion concepts and their function in a large language model

The assistant axis: situating and stabilizing the character of large language models

Signs of introspection in large language models

Persona vectors: Monitoring and controlling character traits in language models

Publications