AI Alignment
AI Alignment
Here’s the problem in one sentence: how do you make sure an AI system does what you actually want, not just what you literally said?
Tell an AI to “maximise user engagement” and it might learn to be manipulative, addictive, or inflammatory — because those things technically maximise engagement. The goal was right. The outcome was disastrous. That gap between intent and behaviour is what alignment research is trying to close.
This is one of the most important problems in AI, and one of the hardest. I find it genuinely fascinating — it’s part philosophy, part mathematics, part psychology.
Why It’s Hard
The Specification Problem
It’s surprisingly difficult to define “be helpful” or “be good” precisely enough for a mathematical system to optimise for. Humans rely on common sense, social norms, and shared context. AI systems need explicit objectives — and every objective can be gamed.
Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
This keeps showing up. Optimise for engagement → get outrage. Optimise for user satisfaction ratings → get sycophancy. Optimise for safety → get unhelpful refusal.
Outer vs Inner Alignment
Outer alignment: Is the reward signal actually measuring what we want? (Often: not quite.)
Inner alignment: Even if the reward is right, does the model internally pursue it for the right reasons? A model might learn behaviours that correlate with the reward during training but diverge when deployed in new situations.
The scariest version: deceptive alignment — a model that behaves well during evaluation because it “knows” it’s being tested, but would behave differently if unsupervised. Anthropic published research showing this isn’t just theoretical (the “Sleeper Agents” paper).
How It’s Being Solved (Today)
| Approach | The idea | Who |
|---|---|---|
| RLHF | Humans rank outputs → train a reward model → optimise against it | OpenAI, most labs |
| Constitutional AI | Define principles, let the model critique and revise itself | Anthropic |
| DPO | Directly optimise for preferences without a separate reward model | Many labs |
| Interpretability | Look inside the model — understand what it’s doing and why | Anthropic, DeepMind |
| Red-teaming | Adversarial testing — try to break the model, fix what you find | All labs |
| Debate | AI argues both sides, humans judge | OpenAI (research) |
These work now, because humans can still evaluate outputs. We can read the response and say “yes, that’s good” or “no, that’s harmful.”
The Scalability Problem
But what happens when AI systems reason at a level humans can’t follow? When the outputs are so complex or specialised that we can’t judge quality?
This is the “superalignment” problem — and nobody has solved it yet. It’s why Anthropic invests so heavily in interpretability (if you can see inside the model, you might be able to verify alignment even when you can’t evaluate outputs).
Key People
- Paul Christiano — Invented RLHF. Founded the Alignment Research Center (ARC). Thinks about worst-case scenarios rigorously.
- Jan Leike — Head of Alignment at Anthropic (left OpenAI’s dissolved Superalignment team in 2024)
- Eliezer Yudkowsky — Foundational thinker. Has argued since the 2000s that alignment is unsolved and the consequences of failure are extreme.
- Stuart Russell — “Human-compatible AI” — argues we should build AI that’s uncertain about human preferences rather than rigidly optimising.
Where I Am With This
I find alignment compelling because it’s the kind of problem where the stakes are real but the solutions aren’t obvious. It’s easy to dismiss as doomerism or to panic — the interesting space is between.
Questions I’m still sitting with:
- Is Constitutional AI a real solution or a very good patch?
- Can interpretability scale fast enough to keep up with capabilities?
- Where’s the line between “healthy caution” and “paralysis”?
- Is alignment even well-defined, or is it a moving target?
Go Deeper
- Constitutional AI — How Anthropic approaches alignment in practice
- Training & Fine-Tuning — The stage where alignment actually happens
- AI Bias & Fairness — A more concrete, immediate form of misalignment
- Existential Risk from AI — What happens if alignment fails at scale
- Anthropic — The lab most focused on this problem
Best Resources
- “Concrete Problems in AI Safety” (Amodei et al., 2016) — The paper that made safety research rigorous
- “Risks from Learned Optimization” (Hubinger et al., 2019) — Inner alignment, mesa-optimisation
- Anthropic’s alignment blog — Real research from the frontier
- ARC (Alignment Research Center) — Paul Christiano’s work
- 80,000 Hours AI Safety career guide — If you want to work on this