LEARNING

AI Alignment

Created 2 May 2025

safetyalignmentrlhfvalues

AI Alignment

Here’s the problem in one sentence: how do you make sure an AI system does what you actually want, not just what you literally said?

Tell an AI to “maximise user engagement” and it might learn to be manipulative, addictive, or inflammatory — because those things technically maximise engagement. The goal was right. The outcome was disastrous. That gap between intent and behaviour is what alignment research is trying to close.

This is one of the most important problems in AI, and one of the hardest. I find it genuinely fascinating — it’s part philosophy, part mathematics, part psychology.

Why It’s Hard

The Specification Problem

It’s surprisingly difficult to define “be helpful” or “be good” precisely enough for a mathematical system to optimise for. Humans rely on common sense, social norms, and shared context. AI systems need explicit objectives — and every objective can be gamed.

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

This keeps showing up. Optimise for engagement → get outrage. Optimise for user satisfaction ratings → get sycophancy. Optimise for safety → get unhelpful refusal.

Outer vs Inner Alignment

Outer alignment: Is the reward signal actually measuring what we want? (Often: not quite.)

Inner alignment: Even if the reward is right, does the model internally pursue it for the right reasons? A model might learn behaviours that correlate with the reward during training but diverge when deployed in new situations.

The scariest version: deceptive alignment — a model that behaves well during evaluation because it “knows” it’s being tested, but would behave differently if unsupervised. Anthropic published research showing this isn’t just theoretical (the “Sleeper Agents” paper).

How It’s Being Solved (Today)

Approach	The idea	Who
RLHF	Humans rank outputs → train a reward model → optimise against it	OpenAI, most labs
Constitutional AI	Define principles, let the model critique and revise itself	Anthropic
DPO	Directly optimise for preferences without a separate reward model	Many labs
Interpretability	Look inside the model — understand what it’s doing and why	Anthropic, DeepMind
Red-teaming	Adversarial testing — try to break the model, fix what you find	All labs
Debate	AI argues both sides, humans judge	OpenAI (research)

These work now, because humans can still evaluate outputs. We can read the response and say “yes, that’s good” or “no, that’s harmful.”

The Scalability Problem

But what happens when AI systems reason at a level humans can’t follow? When the outputs are so complex or specialised that we can’t judge quality?

This is the “superalignment” problem — and nobody has solved it yet. It’s why Anthropic invests so heavily in interpretability (if you can see inside the model, you might be able to verify alignment even when you can’t evaluate outputs).

Key People

Paul Christiano — Invented RLHF. Founded the Alignment Research Center (ARC). Thinks about worst-case scenarios rigorously.
Jan Leike — Head of Alignment at Anthropic (left OpenAI’s dissolved Superalignment team in 2024)
Eliezer Yudkowsky — Foundational thinker. Has argued since the 2000s that alignment is unsolved and the consequences of failure are extreme.
Stuart Russell — “Human-compatible AI” — argues we should build AI that’s uncertain about human preferences rather than rigidly optimising.

Where I Am With This

I find alignment compelling because it’s the kind of problem where the stakes are real but the solutions aren’t obvious. It’s easy to dismiss as doomerism or to panic — the interesting space is between.

Questions I’m still sitting with:

Is Constitutional AI a real solution or a very good patch?
Can interpretability scale fast enough to keep up with capabilities?
Where’s the line between “healthy caution” and “paralysis”?
Is alignment even well-defined, or is it a moving target?

Go Deeper

Constitutional AI — How Anthropic approaches alignment in practice
Training & Fine-Tuning — The stage where alignment actually happens
AI Bias & Fairness — A more concrete, immediate form of misalignment
Existential Risk from AI — What happens if alignment fails at scale
Anthropic — The lab most focused on this problem

Best Resources

“Concrete Problems in AI Safety” (Amodei et al., 2016) — The paper that made safety research rigorous
“Risks from Learned Optimization” (Hubinger et al., 2019) — Inner alignment, mesa-optimisation
Anthropic’s alignment blog — Real research from the frontier
ARC (Alignment Research Center) — Paul Christiano’s work
80,000 Hours AI Safety career guide — If you want to work on this