Alignment
Future AI systems will be even more powerful than today’s, likely in ways that break key assumptions behind current safety techniques. That’s why it’s important to develop sophisticated safeguards to ensure models remain helpful, honest, and harmless. The Alignment team works to understand the challenges ahead and create protocols to train, evaluate, and monitor highly-capable models safely.
Evaluation and oversight
Alignment researchers validate that models are harmless and honest even under very different circumstances than those under which they were trained. They also develop methods to allow humans to collaborate with language models to verify claims that humans might not be able to on their own.
Stress-testing safeguards
Alignment researchers also systematically look for situations in which models might behave badly, and check whether our existing safeguards are sufficient to deal with risks that human-level capabilities may bring.
Automated Alignment Researchers: Using large language models to scale scalable oversight
Can Claude develop, test, and analyze alignment ideas of its own? We ran an experiment to find out.
The persona selection model
Why do AI assistants like Claude sometimes seem surprisingly human. We advance a theory.
Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
Last year, we described a new approach to defend against jailbreaks, which we called Constitutional Classifiers. We’ve now developed the next generation.
From shortcuts to sabotage: natural emergent misalignment from reward hacking
We show for the first time that realistic AI training processes can accidentally produce misaligned models.
Publications
- Teaching Claude why
- Donating our open-source alignment tool
- Automated Alignment Researchers: Using large language models to scale scalable oversight
- An update on our model deprecation commitments for Claude Opus 3
- The persona selection model
- How AI assistance impacts the formation of coding skills
- Disempowerment patterns in real-world AI usage
- Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
- Introducing Bloom: an open source tool for automated behavioral evaluations
- From shortcuts to sabotage: natural emergent misalignment from reward hacking