RESEARCH

Constitutional AI

Created 2 May 2025
paperalignmentsafetyrlhfconstitutional-ai

Constitutional AI: Harmlessness from AI Feedback (2022)

Source

Key Takeaways

  1. Alternative to RLHF — Instead of relying entirely on human raters, use AI feedback guided by a set of principles (a “constitution”)
  2. Two-Phase Process:
    • Critique & Revision: Model generates a response, then critiques it against principles, then revises
    • RL from AI Feedback (RLAIF): Train a reward model on AI-generated preference data
  3. Principles-Based: The “constitution” is a set of explicit principles (be helpful, be honest, avoid harm, etc.)
  4. Scalable: Reduces dependence on expensive human labelling
  5. Transparent: The principles are readable and auditable

How It Works

Phase 1 — Supervised Learning (SL-CAI):
1. Generate response to harmful prompt
2. Ask model to critique response using a principle
3. Ask model to revise based on critique
4. Train on the revised responses

Phase 2 — RL (RL-CAI):
1. Generate pairs of responses
2. Ask model which is better (using principles)
3. Train reward model on these AI preferences
4. Use RL (PPO) to optimise against reward model

Why It Matters

  • Used to train Claude models — this is how Anthropic does alignment
  • More transparent than pure RLHF (principles are explicit, not hidden in rater instructions)
  • More scalable (AI feedback is cheaper than human feedback)
  • Enables iterative improvement (update principles, retrain)
  • Influenced the field’s thinking about “rules-based” vs “vibes-based” alignment

Questions / Follow-up

  • How do you choose good principles? What happens with conflicting principles?
  • How does this relate to Anthropic’s Responsible Scaling Policy?
  • Compare with OpenAI’s RLHF approach — trade-offs?
enes