RESEARCH

Scaling Laws for Neural Language Models

Created 2 May 2025

paperscalingfoundationalcomputetraining

Scaling Laws for Neural Language Models (2020)

This paper changed how the entire industry thinks about AI. Before it, building better models was guesswork — try different architectures, hope for the best. After it, there was a formula: more compute, more data, more parameters = predictably better performance. And it follows clean mathematical curves.

That insight is why companies are spending billions on NVIDIA GPUs. It’s why the AI race is partly a capital race. It’s the empirical foundation beneath the entire scaling era.

What It Found

Jared Kaplan and collaborators at Johns Hopkins University and OpenAI trained hundreds of language models at different sizes and measured performance systematically. The findings:

1. Performance scales as a power law with three variables:

Model size (number of parameters)
Dataset size (tokens of training data)
Compute budget (FLOPs spent training)

2. These relationships are smooth and predictable. You can plot a curve with small models and predict how a 10x bigger model will perform before building it.

3. Model size matters most. Given a fixed compute budget, you’re better off training a larger model for fewer steps than a smaller model for longer. (This finding was later refined by the Chinchilla paper.)

4. Architecture matters less than scale. Whether you tweak attention heads, layer width, or depth — the scaling curves barely shift. Size dominates.

Why It Matters

This paper gave the field a roadmap. Instead of searching for clever architectural tricks, labs could simply scale — and know roughly what they’d get.

It’s the intellectual justification for:

OpenAI‘s bet on GPT-3 (175B parameters) and GPT-4
The massive compute investments ($100B+ across the industry)
The “scaling hypothesis” — the idea that intelligence might emerge primarily from scale
Anthropic‘s founding thesis (Kaplan and Amodei are both co-founders)

It also created a strategic problem: if capability is mostly a function of capital, then AI progress is concentrated in whoever can spend the most. That has implications for competition, safety, and power.

The Key Insight in Plain Terms

Imagine you’re building a factory. This paper proved that, for AI:

A factory twice as big produces products that are predictably better
The improvement follows a mathematical law, not random luck
You can calculate in advance whether it’s worth building the bigger factory

That predictability is what turned AI from a research field into an arms race.

What Came After

Chinchilla (2022, Google DeepMind) — Refined the scaling laws. Argued that Kaplan et al. under-valued data: models were being made too big and trained on too little data. The “compute-optimal” ratio was different than originally proposed.

GPT-3 and beyond — Direct application of these findings. “We know it’ll work. Let’s just go bigger.”

The capital race — If performance is a function of compute, and compute is a function of money, then AI becomes a spending competition. This paper is why.

The Authors

Jared Kaplan — Lead author. Physics professor at Johns Hopkins University who turned his attention to AI scaling. Co-founded Anthropic.
Sam McCandlish — Co-author. Also co-founded Anthropic.
Tom Brown — Later lead author of the GPT-3 paper.
Dario Amodei — Senior author. VP Research at OpenAI at the time, later co-founded Anthropic.

Notice the pattern: several authors of this paper went on to found Anthropic. The scaling laws told them something about where AI was heading — and they decided to build a lab that took safety seriously at that scale.

Read It

Paper: arxiv.org/abs/2001.08361
Published January 2020
22 pages. Readable. The graphs tell the story.

Go Deeper

Jared Kaplan — Lead author, the physicist who quantified scaling
Anthropic — Founded by several of this paper’s authors
OpenAI — Where the research was conducted
Training & Fine-Tuning — The process these laws describe
How LLMs Work — The systems being scaled
NVIDIA — The hardware that makes scaling possible
Attention Is All You Need — The architecture being scaled