Training & Fine-Tuning
Training & Fine-Tuning
You’ve probably used a fine-tuned model today without knowing it. ChatGPT, Claude, Gemini — none of them ship as raw “base models.” They go through a careful multi-stage process that turns a pattern-matching engine into something that feels helpful, honest, and safe.
Understanding how this works explains why AI behaves the way it does — the hallucinations, the guardrails, the occasional weird refusal. It’s all a product of training.
The Three Stages
Building a modern AI model like Claude or GPT-4 isn’t one process. It’s three, stacked on top of each other.
Stage 1: Pre-training
What happens: The model reads the internet. Trillions of words — books, websites, code, conversations. Its only job: predict the next word.
What it produces: A “base model” that’s incredibly good at continuing text, but has no idea how to be helpful. Ask it a question and it might just… keep writing the question in different ways.
What it costs: Millions of dollars. Thousands of NVIDIA GPUs. Weeks to months of continuous compute.
This is the expensive part. Everything after is comparatively cheap.
Stage 2: Supervised Fine-Tuning (SFT)
What happens: Humans write thousands of examples: “Here’s a question, here’s how you should answer.” The model learns the format of being helpful.
What it produces: A model that follows instructions, answers questions, and feels conversational. Most of the “personality” comes from here.
Stage 3: Alignment (RLHF / RLAIF / Constitutional AI)
What happens: The model generates responses, and raters (human or AI) rank them: “This answer is better than that one.” A reward model learns those preferences. The main model is optimised to match them.
What it produces: A model that’s not just capable, but aligned — less likely to be harmful, more likely to say “I don’t know” when uncertain.
Key variants:
- RLHF — Human feedback (OpenAI’s approach)
- RLAIF — AI feedback, guided by principles (Anthropic’s approach)
- DPO — Direct Preference Optimisation (simpler, no reward model needed)
Why Models Hallucinate
This is where understanding training actually helps you in practice.
During pre-training, the model learned to be confident and fluent. Text that flows well was rewarded. “I don’t know” never appeared in the training signal — it was always better to generate something plausible-sounding.
Alignment tries to fix this, but it’s fighting against trillions of tokens of conditioning. The model’s first instinct is still to sound sure. That’s why RAG (grounding in real documents) and careful prompting matter so much.
Fine-Tuning Your Own Models
You can’t pre-train a model (unless you’re a lab with billions to spend). But you can fine-tune one. This is where open models like LLaMA shine — take a foundation model and specialise it for your use case.
| Method | What it trains | Cost | Best for |
|---|---|---|---|
| Full fine-tune | Everything | Massive | Labs with resources |
| LoRA | ~0.5% of weights | Low | Most practical use cases |
| QLoRA | Same, but quantised | Very low | Consumer GPUs (24GB) |
| Prefix tuning | Just a prefix | Minimal | Quick experiments |
LoRA (Low-Rank Adaptation) is the sweet spot for most people. You freeze the original weights and train small adaptor matrices alongside them. The result is almost as good as full fine-tuning at a fraction of the cost.
Key Vocabulary
| Term | Plain English |
|---|---|
| Loss | How wrong the model is (lower = better) |
| Learning rate | How big the correction steps are |
| Epoch | One complete pass through all training data |
| Overfitting | Model memorises examples instead of learning patterns |
| Tokenisation | Breaking text into sub-word pieces before the model sees it |
| Gradient descent | The algorithm that makes all of this work — follow the slope downhill |
What I’m Still Learning
- The practical details of running a LoRA fine-tune end-to-end
- How DPO compares to full RLHF in practice (it’s simpler but is it as good?)
- Where the line is between “fine-tuning” and “just write a better prompt”
Go Deeper
- AI Alignment — The deeper challenge of making models want the right things
- Constitutional AI — Anthropic’s approach to alignment without human raters
- Neural Networks — How the weights being updated actually work
- Transformers — The architecture being trained
- How LLMs Work — The full pipeline, training included
Best Resources
- Andrej Karpathy “Let’s build GPT” — Actually train a small Transformer
- Hugging Face PEFT library — Practical LoRA/QLoRA implementation
- Sebastian Raschka “Build an LLM From Scratch” — Thorough book-length treatment
- “Training language models to follow instructions” (InstructGPT paper) — How SFT+RLHF was invented