LEARNING

How LLMs Work

Created 2 May 2025
learningllmunderstandingfundamentals

How LLMs Work

What is it?

A Large Language Model (LLM) is a Transformer neural network trained on vast amounts of text to predict the next token. Despite this simple objective, LLMs develop emergent capabilities: reasoning, coding, translation, analysis, and more.

The Pipeline: From Text to Response

1. Tokenisation

Text → sub-word tokens (using BPE or SentencePiece)

"Hello, world!" → ["Hello", ",", " world", "!"]  (each mapped to an integer ID)
  • Vocabulary: ~32K-100K tokens
  • Tokens ≠ words (common words = 1 token, rare words = multiple tokens)
  • ~4 characters per token on average (English)

2. Embedding

Token IDs → dense vectors (e.g., 4096 dimensions)

[15496, 11, 995, 0] → [[0.2, -0.1, ...], [0.5, 0.3, ...], ...]

Plus positional information (RoPE in modern models).

3. Transformer Layers (× many)

Each layer applies:

  1. Self-Attention — Each token “looks at” all previous tokens, decides what’s relevant
  2. Feed-Forward Network — Processes each position independently (where “knowledge” is stored)
  3. Residual connections — Add input back to output (helps with training deep networks)

Modern LLMs have 32-128+ layers, processing tokens through increasingly abstract representations.

4. Output

Final hidden state → vocabulary projection → probability over all tokens

[hidden state] → linear layer → softmax → P("the") = 0.12, P("a") = 0.08, ...

Sample from this distribution → that’s your next token.

5. Autoregressive Generation

"The capital of France" → P(next) → "is"
"The capital of France is" → P(next) → "Paris"
"The capital of France is Paris" → P(next) → "."

One token at a time, feeding output back as input.

Key Concepts

Context Window

The maximum number of tokens the model can “see” at once.

  • GPT-4: 128K tokens
  • Claude: 200K tokens
  • Gemini: 1M+ tokens

Temperature

Controls randomness in token sampling:

  • 0.0 — Always pick highest probability (deterministic)
  • 0.7 — Balanced (some creativity)
  • 1.0+ — More random (creative but may go off-rails)

Attention Patterns

The model learns WHAT to pay attention to:

  • In “The cat sat on the mat because it was tired” → “it” attends strongly to “cat”
  • Different heads learn different patterns (syntax, semantics, position)

Where Knowledge Lives

  • Attention: Relationships between tokens (syntax, reference, reasoning chains)
  • Feed-Forward layers: Factual knowledge stored as key-value memories
  • Embeddings: Semantic meaning of individual tokens

The Big Questions

  • Do they “understand”? — Debated. They model statistical patterns incredibly well. Whether that constitutes understanding is philosophical.
  • Why do they hallucinate? — Trained to be confident and fluent. When uncertain, they generate plausible-sounding text rather than saying “I don’t know.”
  • What are emergent capabilities? — Abilities that appear at scale but aren’t explicitly trained (e.g., arithmetic, code, reasoning)

Resources

  • Andrej Karpathy “Intro to Large Language Models” (1-hour talk)
  • “The Illustrated Transformer” (Jay Alammar)
  • 3Blue1Brown “But what is a GPT?” series
  • Anthropic’s “Scaling Monosemanticity” (what’s inside the model)
  • Transformers — The architecture
  • Training & Fine-Tuning — How they learn
enes