RESEARCH

Constitutional AI

Created 2 May 2025

paperalignmentsafetyrlhfconstitutional-ai

Constitutional AI: Harmlessness from AI Feedback (2022)

Source

Paper: https://arxiv.org/abs/2212.08073
Authors: Yuntao Bai et al. (Anthropic)
Published: December 2022

Key Takeaways

Alternative to RLHF — Instead of relying entirely on human raters, use AI feedback guided by a set of principles (a “constitution”)
Two-Phase Process:
- Critique & Revision: Model generates a response, then critiques it against principles, then revises
- RL from AI Feedback (RLAIF): Train a reward model on AI-generated preference data
Principles-Based: The “constitution” is a set of explicit principles (be helpful, be honest, avoid harm, etc.)
Scalable: Reduces dependence on expensive human labelling
Transparent: The principles are readable and auditable

How It Works

Phase 1 — Supervised Learning (SL-CAI):
1. Generate response to harmful prompt
2. Ask model to critique response using a principle
3. Ask model to revise based on critique
4. Train on the revised responses

Phase 2 — RL (RL-CAI):
1. Generate pairs of responses
2. Ask model which is better (using principles)
3. Train reward model on these AI preferences
4. Use RL (PPO) to optimise against reward model

Why It Matters

Used to train Claude models — this is how Anthropic does alignment
More transparent than pure RLHF (principles are explicit, not hidden in rater instructions)
More scalable (AI feedback is cheaper than human feedback)
Enables iterative improvement (update principles, retrain)
Influenced the field’s thinking about “rules-based” vs “vibes-based” alignment

Questions / Follow-up

How do you choose good principles? What happens with conflicting principles?
How does this relate to Anthropic’s Responsible Scaling Policy?
Compare with OpenAI’s RLHF approach — trade-offs?