How LLMs Work
How LLMs Work
What is it?
A Large Language Model (LLM) is a Transformer neural network trained on vast amounts of text to predict the next token. Despite this simple objective, LLMs develop emergent capabilities: reasoning, coding, translation, analysis, and more.
The Pipeline: From Text to Response
1. Tokenisation
Text → sub-word tokens (using BPE or SentencePiece)
"Hello, world!" → ["Hello", ",", " world", "!"] (each mapped to an integer ID) - Vocabulary: ~32K-100K tokens
- Tokens ≠ words (common words = 1 token, rare words = multiple tokens)
- ~4 characters per token on average (English)
2. Embedding
Token IDs → dense vectors (e.g., 4096 dimensions)
[15496, 11, 995, 0] → [[0.2, -0.1, ...], [0.5, 0.3, ...], ...] Plus positional information (RoPE in modern models).
3. Transformer Layers (× many)
Each layer applies:
- Self-Attention — Each token “looks at” all previous tokens, decides what’s relevant
- Feed-Forward Network — Processes each position independently (where “knowledge” is stored)
- Residual connections — Add input back to output (helps with training deep networks)
Modern LLMs have 32-128+ layers, processing tokens through increasingly abstract representations.
4. Output
Final hidden state → vocabulary projection → probability over all tokens
[hidden state] → linear layer → softmax → P("the") = 0.12, P("a") = 0.08, ... Sample from this distribution → that’s your next token.
5. Autoregressive Generation
"The capital of France" → P(next) → "is"
"The capital of France is" → P(next) → "Paris"
"The capital of France is Paris" → P(next) → "." One token at a time, feeding output back as input.
Key Concepts
Context Window
The maximum number of tokens the model can “see” at once.
- GPT-4: 128K tokens
- Claude: 200K tokens
- Gemini: 1M+ tokens
Temperature
Controls randomness in token sampling:
- 0.0 — Always pick highest probability (deterministic)
- 0.7 — Balanced (some creativity)
- 1.0+ — More random (creative but may go off-rails)
Attention Patterns
The model learns WHAT to pay attention to:
- In “The cat sat on the mat because it was tired” → “it” attends strongly to “cat”
- Different heads learn different patterns (syntax, semantics, position)
Where Knowledge Lives
- Attention: Relationships between tokens (syntax, reference, reasoning chains)
- Feed-Forward layers: Factual knowledge stored as key-value memories
- Embeddings: Semantic meaning of individual tokens
The Big Questions
- Do they “understand”? — Debated. They model statistical patterns incredibly well. Whether that constitutes understanding is philosophical.
- Why do they hallucinate? — Trained to be confident and fluent. When uncertain, they generate plausible-sounding text rather than saying “I don’t know.”
- What are emergent capabilities? — Abilities that appear at scale but aren’t explicitly trained (e.g., arithmetic, code, reasoning)
Resources
- Andrej Karpathy “Intro to Large Language Models” (1-hour talk)
- “The Illustrated Transformer” (Jay Alammar)
- 3Blue1Brown “But what is a GPT?” series
- Anthropic’s “Scaling Monosemanticity” (what’s inside the model)
- Transformers — The architecture
- Training & Fine-Tuning — How they learn