Attention Is All You Need
Created 2 May 2025
papertransformersattentionfoundational
Attention Is All You Need (2017)
Source
- Paper: https://arxiv.org/abs/1706.03762
- Authors: Ashish Vaswani et al. (Google Brain & Google Research)
- Published: June 2017 (NeurIPS)
- Citations: 100,000+ (one of the most cited CS papers ever)
Key Takeaways
- Introduced the Transformer — A new architecture based entirely on attention mechanisms, eliminating recurrence (RNNs) and convolutions
- Self-Attention — Each position attends to all other positions in the sequence, capturing long-range dependencies efficiently
- Parallelisable — Unlike RNNs, all positions can be computed simultaneously → massive speed gains on GPUs
- Multi-Head Attention — Multiple attention “heads” learn different relationship types
- Positional Encoding — Since there’s no sequential processing, position must be encoded explicitly (sinusoidal functions)
Architecture Summary
Input → Embedding + Positional Encoding → Encoder (×6) → Decoder (×6) → Output
Encoder block:
→ Multi-Head Self-Attention
→ Add & Norm
→ Feed-Forward Network
→ Add & Norm
Decoder block:
→ Masked Multi-Head Self-Attention (can't peek ahead)
→ Add & Norm
→ Cross-Attention (attends to encoder output)
→ Add & Norm
→ Feed-Forward Network
→ Add & Norm Why It Matters
This paper is the foundation of everything in modern AI:
- GPT = decoder-only Transformer
- BERT = encoder-only Transformer
- T5 = full encoder-decoder Transformer
- Vision Transformer (ViT) = attention applied to image patches
- All modern LLMs are Transformers
Raw Notes
- Original task was machine translation (English → German, English → French)
- Achieved state-of-the-art with significantly less training time
- Key insight: attention is O(1) in sequential operations (vs O(n) for RNNs)
- The title is a statement of confidence — “you don’t need anything else”
Questions / Follow-up
- How do modern architectures differ from the original? (no encoder for GPT, longer contexts, RoPE instead of sinusoidal)
- What is FlashAttention and how does it make attention more efficient?
- Explore the “attention sink” phenomenon