LEARNING

Transformers

Updated 2 May 2025
learningmachine-learningtransformersattention

Transformers

Every AI model you’ve heard of — GPT, Claude, Gemini, LLaMA — is built on the same architecture. It’s called the Transformer, and understanding it is like understanding the engine inside the car you’re already driving.

Introduced in a 2017 paper called Attention Is All You Need, the Transformer replaced everything that came before it with a single powerful idea: attention.


The Core Insight

Older architectures (RNNs, LSTMs) read text one word at a time, left to right, like a typewriter. This made them slow and forgetful — by the time they reached the end of a paragraph, they’d already lost the beginning.

Transformers read everything at once. Every word can “look at” every other word simultaneously and decide what’s relevant. That’s attention.

Think of it this way: when you read the sentence “The cat sat on the mat because it was tired”, you instantly know “it” means “the cat.” A Transformer learns to make that same connection — by attending from “it” back to “cat.”


How It Works

Self-Attention (the magic bit)

Each word asks three questions:

  • Query: “What am I looking for?”
  • Key: “What do I have to offer?”
  • Value: “What information do I carry?”

Every word’s Query gets compared against every other word’s Key. Strong matches mean “pay attention here.” The result is a weighted mix of Values — a new representation that’s informed by context.

Multi-Head Attention

One attention pattern isn’t enough. The model runs several attention “heads” in parallel — one might track grammar, another tracks meaning, another tracks who did what to whom. Together, they build a rich understanding.

Positional Encoding

Since the Transformer sees all words at once (no left-to-right), it needs another way to know word order. Position is encoded into the embeddings directly — modern models use an approach called RoPE (Rotary Position Embeddings).

The Architecture

The original Transformer has two halves:

  • Encoder — reads and understands input (used by BERT)
  • Decoder — generates output token by token (used by GPT, Claude)

Most modern LLMs use decoder-only — they just generate, one token at a time, informed by everything that came before.


What This Unlocked

The Transformer didn’t just improve AI — it unified it. The same architecture now handles:

DomainExamples
LanguageGPT, Claude, Gemini, LLaMA
ImagesVision Transformers (ViT), DALL-E
CodeCodex, Claude Code, Copilot
AudioWhisper (speech-to-text)
VideoSora, Runway
ScienceAlphaFold (protein structure)

One architecture to rule them all. That’s why this matters.


What I’m Still Working Through

  • How FlashAttention makes the quadratic cost of attention manageable
  • The “attention sink” phenomenon (why models allocate attention to the first token)
  • How sparse attention and mixture-of-experts (Mixtral) make scaling practical

Go Deeper

Best Resources

  • The Illustrated Transformer (Jay Alammar) — visual walkthrough, gold standard
  • 3Blue1Brown — “Attention in Transformers, visually explained”
  • Andrej Karpathy “Let’s build GPT” — code it from scratch in 2 hours
enes