Interpretability
The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.
Safety through understanding
It's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.
Multidisciplinary approach
Some Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.
Emotion concepts and their function in a large language model
All modern language models sometimes act like they have emotions. What’s behind these behaviors? Our interpretability team investigates.
The assistant axis: situating and stabilizing the character of large language models
Who is the Assistant? We investigate the character that most modern language models inhabit when interacting with users.
Signs of introspection in large language models
Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect.
Persona vectors: Monitoring and controlling character traits in language models
AI models represent character traits as patterns of activations within their neural networks. By extracting "persona vectors" for traits like sycophancy or hallucination, we can monitor personality shifts and mitigate undesirable behaviors.
Publications
- Natural Language Autoencoders: Turning Claude’s thoughts into text
- Emotion concepts and their function in a large language model
- A “diff” tool for AI: Finding behavioral differences in new models
- The assistant axis: situating and stabilizing the character of large language models
- Signs of introspection in large language models
- Persona vectors: Monitoring and controlling character traits in language models
- Open-sourcing circuit tracing tools
- Tracing the thoughts of a large language model
- Auditing language models for hidden objectives
- Insights on Crosscoder Model Diffing