LEARNING

Transformers

Updated 2 May 2025

learningmachine-learningtransformersattention

Transformers

Every AI model you’ve heard of — GPT, Claude, Gemini, LLaMA — is built on the same architecture. It’s called the Transformer, and understanding it is like understanding the engine inside the car you’re already driving.

Introduced in a 2017 paper called Attention Is All You Need, the Transformer replaced everything that came before it with a single powerful idea: attention.

The Core Insight

Older architectures (RNNs, LSTMs) read text one word at a time, left to right, like a typewriter. This made them slow and forgetful — by the time they reached the end of a paragraph, they’d already lost the beginning.

Transformers read everything at once. Every word can “look at” every other word simultaneously and decide what’s relevant. That’s attention.

Think of it this way: when you read the sentence “The cat sat on the mat because it was tired”, you instantly know “it” means “the cat.” A Transformer learns to make that same connection — by attending from “it” back to “cat.”

How It Works

Self-Attention (the magic bit)

Each word asks three questions:

Query: “What am I looking for?”
Key: “What do I have to offer?”
Value: “What information do I carry?”

Every word’s Query gets compared against every other word’s Key. Strong matches mean “pay attention here.” The result is a weighted mix of Values — a new representation that’s informed by context.

Multi-Head Attention

One attention pattern isn’t enough. The model runs several attention “heads” in parallel — one might track grammar, another tracks meaning, another tracks who did what to whom. Together, they build a rich understanding.

Positional Encoding

Since the Transformer sees all words at once (no left-to-right), it needs another way to know word order. Position is encoded into the embeddings directly — modern models use an approach called RoPE (Rotary Position Embeddings).

The Architecture

The original Transformer has two halves:

Encoder — reads and understands input (used by BERT)
Decoder — generates output token by token (used by GPT, Claude)

Most modern LLMs use decoder-only — they just generate, one token at a time, informed by everything that came before.

What This Unlocked

The Transformer didn’t just improve AI — it unified it. The same architecture now handles:

Domain	Examples
Language	GPT, Claude, Gemini, LLaMA
Images	Vision Transformers (ViT), DALL-E
Code	Codex, Claude Code, Copilot
Audio	Whisper (speech-to-text)
Video	Sora, Runway
Science	AlphaFold (protein structure)

One architecture to rule them all. That’s why this matters.

What I’m Still Working Through

How FlashAttention makes the quadratic cost of attention manageable
The “attention sink” phenomenon (why models allocate attention to the first token)
How sparse attention and mixture-of-experts (Mixtral) make scaling practical

Go Deeper

Attention Is All You Need — The original paper, annotated
Neural Networks — If attention doesn’t click yet, start here
Embeddings — How words become numbers before attention happens
Training & Fine-Tuning — How Transformers actually learn
How LLMs Work — The full pipeline from text in to text out

Best Resources

The Illustrated Transformer (Jay Alammar) — visual walkthrough, gold standard
3Blue1Brown — “Attention in Transformers, visually explained”
Andrej Karpathy “Let’s build GPT” — code it from scratch in 2 hours