Alignment

Future AI systems will be even more powerful than today’s, likely in ways that break key assumptions behind current safety techniques. That’s why it’s important to develop sophisticated safeguards to ensure models remain helpful, honest, and harmless. The Alignment team works to understand the challenges ahead and create protocols to train, evaluate, and monitor highly-capable models safely.

Research teams:Alignment Economic Research Interpretability Societal Impacts

Evaluation and oversight

Alignment researchers validate that models are harmless and honest even under very different circumstances than those under which they were trained. They also develop methods to allow humans to collaborate with language models to verify claims that humans might not be able to on their own.

Stress-testing safeguards

Alignment researchers also systematically look for situations in which models might behave badly, and check whether our existing safeguards are sufficient to deal with risks that human-level capabilities may bring.

Teaching Claude why

AlignmentMay 8, 2026

New research on how we've reduced agentic misalignment.

AlignmentApr 14, 2026

Alignment

Evaluation and oversight

Stress-testing safeguards

Teaching Claude why

Automated Alignment Researchers: Using large language models to scale scalable oversight

The persona selection model

Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Publications