RESEARCH

Attention Is All You Need

Created 2 May 2025

papertransformersattentionfoundational

Attention Is All You Need (2017)

Source

Paper: https://arxiv.org/abs/1706.03762
Authors: Ashish Vaswani et al. (Google Brain & Google Research)
Published: June 2017 (NeurIPS)
Citations: 100,000+ (one of the most cited CS papers ever)

Key Takeaways

Introduced the Transformer — A new architecture based entirely on attention mechanisms, eliminating recurrence (RNNs) and convolutions
Self-Attention — Each position attends to all other positions in the sequence, capturing long-range dependencies efficiently
Parallelisable — Unlike RNNs, all positions can be computed simultaneously → massive speed gains on GPUs
Multi-Head Attention — Multiple attention “heads” learn different relationship types
Positional Encoding — Since there’s no sequential processing, position must be encoded explicitly (sinusoidal functions)

Architecture Summary

Input → Embedding + Positional Encoding → Encoder (×6) → Decoder (×6) → Output

Encoder block:
  → Multi-Head Self-Attention
  → Add & Norm
  → Feed-Forward Network
  → Add & Norm

Decoder block:
  → Masked Multi-Head Self-Attention (can't peek ahead)
  → Add & Norm
  → Cross-Attention (attends to encoder output)
  → Add & Norm
  → Feed-Forward Network
  → Add & Norm

Why It Matters

This paper is the foundation of everything in modern AI:

GPT = decoder-only Transformer
BERT = encoder-only Transformer
T5 = full encoder-decoder Transformer
Vision Transformer (ViT) = attention applied to image patches
All modern LLMs are Transformers

Raw Notes

Original task was machine translation (English → German, English → French)
Achieved state-of-the-art with significantly less training time
Key insight: attention is O(1) in sequential operations (vs O(n) for RNNs)
The title is a statement of confidence — “you don’t need anything else”

Questions / Follow-up

How do modern architectures differ from the original? (no encoder for GPT, longer contexts, RoPE instead of sinusoidal)
What is FlashAttention and how does it make attention more efficient?
Explore the “attention sink” phenomenon