LEARNING

Training & Fine-Tuning

Created 2 May 2025

learningmachine-learningtrainingfine-tuningrlhf

Training & Fine-Tuning

You’ve probably used a fine-tuned model today without knowing it. ChatGPT, Claude, Gemini — none of them ship as raw “base models.” They go through a careful multi-stage process that turns a pattern-matching engine into something that feels helpful, honest, and safe.

Understanding how this works explains why AI behaves the way it does — the hallucinations, the guardrails, the occasional weird refusal. It’s all a product of training.

The Three Stages

Building a modern AI model like Claude or GPT-4 isn’t one process. It’s three, stacked on top of each other.

Stage 1: Pre-training

What happens: The model reads the internet. Trillions of words — books, websites, code, conversations. Its only job: predict the next word.

What it produces: A “base model” that’s incredibly good at continuing text, but has no idea how to be helpful. Ask it a question and it might just… keep writing the question in different ways.

What it costs: Millions of dollars. Thousands of NVIDIA GPUs. Weeks to months of continuous compute.

This is the expensive part. Everything after is comparatively cheap.

Stage 2: Supervised Fine-Tuning (SFT)

What happens: Humans write thousands of examples: “Here’s a question, here’s how you should answer.” The model learns the format of being helpful.

What it produces: A model that follows instructions, answers questions, and feels conversational. Most of the “personality” comes from here.

Stage 3: Alignment (RLHF / RLAIF / Constitutional AI)

What happens: The model generates responses, and raters (human or AI) rank them: “This answer is better than that one.” A reward model learns those preferences. The main model is optimised to match them.

What it produces: A model that’s not just capable, but aligned — less likely to be harmful, more likely to say “I don’t know” when uncertain.

Key variants:

RLHF — Human feedback (OpenAI’s approach)
RLAIF — AI feedback, guided by principles (Anthropic’s approach)
DPO — Direct Preference Optimisation (simpler, no reward model needed)

Why Models Hallucinate

This is where understanding training actually helps you in practice.

During pre-training, the model learned to be confident and fluent. Text that flows well was rewarded. “I don’t know” never appeared in the training signal — it was always better to generate something plausible-sounding.

Alignment tries to fix this, but it’s fighting against trillions of tokens of conditioning. The model’s first instinct is still to sound sure. That’s why RAG (grounding in real documents) and careful prompting matter so much.

Fine-Tuning Your Own Models

You can’t pre-train a model (unless you’re a lab with billions to spend). But you can fine-tune one. This is where open models like LLaMA shine — take a foundation model and specialise it for your use case.

Method	What it trains	Cost	Best for
Full fine-tune	Everything	Massive	Labs with resources
LoRA	~0.5% of weights	Low	Most practical use cases
QLoRA	Same, but quantised	Very low	Consumer GPUs (24GB)
Prefix tuning	Just a prefix	Minimal	Quick experiments

LoRA (Low-Rank Adaptation) is the sweet spot for most people. You freeze the original weights and train small adaptor matrices alongside them. The result is almost as good as full fine-tuning at a fraction of the cost.

Key Vocabulary

Term	Plain English
Loss	How wrong the model is (lower = better)
Learning rate	How big the correction steps are
Epoch	One complete pass through all training data
Overfitting	Model memorises examples instead of learning patterns
Tokenisation	Breaking text into sub-word pieces before the model sees it
Gradient descent	The algorithm that makes all of this work — follow the slope downhill

What I’m Still Learning

The practical details of running a LoRA fine-tune end-to-end
How DPO compares to full RLHF in practice (it’s simpler but is it as good?)
Where the line is between “fine-tuning” and “just write a better prompt”

Go Deeper

AI Alignment — The deeper challenge of making models want the right things
Constitutional AI — Anthropic’s approach to alignment without human raters
Neural Networks — How the weights being updated actually work
Transformers — The architecture being trained
How LLMs Work — The full pipeline, training included

Best Resources

Andrej Karpathy “Let’s build GPT” — Actually train a small Transformer
Hugging Face PEFT library — Practical LoRA/QLoRA implementation
Sebastian Raschka “Build an LLM From Scratch” — Thorough book-length treatment
“Training language models to follow instructions” (InstructGPT paper) — How SFT+RLHF was invented