Transformers
Transformers
Every AI model you’ve heard of — GPT, Claude, Gemini, LLaMA — is built on the same architecture. It’s called the Transformer, and understanding it is like understanding the engine inside the car you’re already driving.
Introduced in a 2017 paper called Attention Is All You Need, the Transformer replaced everything that came before it with a single powerful idea: attention.
The Core Insight
Older architectures (RNNs, LSTMs) read text one word at a time, left to right, like a typewriter. This made them slow and forgetful — by the time they reached the end of a paragraph, they’d already lost the beginning.
Transformers read everything at once. Every word can “look at” every other word simultaneously and decide what’s relevant. That’s attention.
Think of it this way: when you read the sentence “The cat sat on the mat because it was tired”, you instantly know “it” means “the cat.” A Transformer learns to make that same connection — by attending from “it” back to “cat.”
How It Works
Self-Attention (the magic bit)
Each word asks three questions:
- Query: “What am I looking for?”
- Key: “What do I have to offer?”
- Value: “What information do I carry?”
Every word’s Query gets compared against every other word’s Key. Strong matches mean “pay attention here.” The result is a weighted mix of Values — a new representation that’s informed by context.
Multi-Head Attention
One attention pattern isn’t enough. The model runs several attention “heads” in parallel — one might track grammar, another tracks meaning, another tracks who did what to whom. Together, they build a rich understanding.
Positional Encoding
Since the Transformer sees all words at once (no left-to-right), it needs another way to know word order. Position is encoded into the embeddings directly — modern models use an approach called RoPE (Rotary Position Embeddings).
The Architecture
The original Transformer has two halves:
- Encoder — reads and understands input (used by BERT)
- Decoder — generates output token by token (used by GPT, Claude)
Most modern LLMs use decoder-only — they just generate, one token at a time, informed by everything that came before.
What This Unlocked
The Transformer didn’t just improve AI — it unified it. The same architecture now handles:
| Domain | Examples |
|---|---|
| Language | GPT, Claude, Gemini, LLaMA |
| Images | Vision Transformers (ViT), DALL-E |
| Code | Codex, Claude Code, Copilot |
| Audio | Whisper (speech-to-text) |
| Video | Sora, Runway |
| Science | AlphaFold (protein structure) |
One architecture to rule them all. That’s why this matters.
What I’m Still Working Through
- How FlashAttention makes the quadratic cost of attention manageable
- The “attention sink” phenomenon (why models allocate attention to the first token)
- How sparse attention and mixture-of-experts (Mixtral) make scaling practical
Go Deeper
- Attention Is All You Need — The original paper, annotated
- Neural Networks — If attention doesn’t click yet, start here
- Embeddings — How words become numbers before attention happens
- Training & Fine-Tuning — How Transformers actually learn
- How LLMs Work — The full pipeline from text in to text out
Best Resources
- The Illustrated Transformer (Jay Alammar) — visual walkthrough, gold standard
- 3Blue1Brown — “Attention in Transformers, visually explained”
- Andrej Karpathy “Let’s build GPT” — code it from scratch in 2 hours