LEARNING

How LLMs Work

Created 2 May 2025

learningllmunderstandingfundamentals

How LLMs Work

What is it?

A Large Language Model (LLM) is a Transformer neural network trained on vast amounts of text to predict the next token. Despite this simple objective, LLMs develop emergent capabilities: reasoning, coding, translation, analysis, and more.

The Pipeline: From Text to Response

1. Tokenisation

Text → sub-word tokens (using BPE or SentencePiece)

"Hello, world!" → ["Hello", ",", " world", "!"]  (each mapped to an integer ID)

Vocabulary: ~32K-100K tokens
Tokens ≠ words (common words = 1 token, rare words = multiple tokens)
~4 characters per token on average (English)

2. Embedding

Token IDs → dense vectors (e.g., 4096 dimensions)

[15496, 11, 995, 0] → [[0.2, -0.1, ...], [0.5, 0.3, ...], ...]

Plus positional information (RoPE in modern models).

3. Transformer Layers (× many)

Each layer applies:

Self-Attention — Each token “looks at” all previous tokens, decides what’s relevant
Feed-Forward Network — Processes each position independently (where “knowledge” is stored)
Residual connections — Add input back to output (helps with training deep networks)

Modern LLMs have 32-128+ layers, processing tokens through increasingly abstract representations.

4. Output

Final hidden state → vocabulary projection → probability over all tokens

[hidden state] → linear layer → softmax → P("the") = 0.12, P("a") = 0.08, ...

Sample from this distribution → that’s your next token.

5. Autoregressive Generation

"The capital of France" → P(next) → "is"
"The capital of France is" → P(next) → "Paris"
"The capital of France is Paris" → P(next) → "."

One token at a time, feeding output back as input.

Key Concepts

Context Window

The maximum number of tokens the model can “see” at once.

GPT-4: 128K tokens
Claude: 200K tokens
Gemini: 1M+ tokens

Temperature

Controls randomness in token sampling:

0.0 — Always pick highest probability (deterministic)
0.7 — Balanced (some creativity)
1.0+ — More random (creative but may go off-rails)

Attention Patterns

The model learns WHAT to pay attention to:

In “The cat sat on the mat because it was tired” → “it” attends strongly to “cat”
Different heads learn different patterns (syntax, semantics, position)

Where Knowledge Lives

Attention: Relationships between tokens (syntax, reference, reasoning chains)
Feed-Forward layers: Factual knowledge stored as key-value memories
Embeddings: Semantic meaning of individual tokens

The Big Questions

Do they “understand”? — Debated. They model statistical patterns incredibly well. Whether that constitutes understanding is philosophical.
Why do they hallucinate? — Trained to be confident and fluent. When uncertain, they generate plausible-sounding text rather than saying “I don’t know.”
What are emergent capabilities? — Abilities that appear at scale but aren’t explicitly trained (e.g., arithmetic, code, reasoning)

Resources

Andrej Karpathy “Intro to Large Language Models” (1-hour talk)
“The Illustrated Transformer” (Jay Alammar)
3Blue1Brown “But what is a GPT?” series
Anthropic’s “Scaling Monosemanticity” (what’s inside the model)
Transformers — The architecture
Training & Fine-Tuning — How they learn