RESEARCH

Attention Is All You Need

Created 2 May 2025
papertransformersattentionfoundational

Attention Is All You Need (2017)

Source

  • Paper: https://arxiv.org/abs/1706.03762
  • Authors: Ashish Vaswani et al. (Google Brain & Google Research)
  • Published: June 2017 (NeurIPS)
  • Citations: 100,000+ (one of the most cited CS papers ever)

Key Takeaways

  1. Introduced the Transformer — A new architecture based entirely on attention mechanisms, eliminating recurrence (RNNs) and convolutions
  2. Self-Attention — Each position attends to all other positions in the sequence, capturing long-range dependencies efficiently
  3. Parallelisable — Unlike RNNs, all positions can be computed simultaneously → massive speed gains on GPUs
  4. Multi-Head Attention — Multiple attention “heads” learn different relationship types
  5. Positional Encoding — Since there’s no sequential processing, position must be encoded explicitly (sinusoidal functions)

Architecture Summary

Input → Embedding + Positional Encoding → Encoder (×6) → Decoder (×6) → Output

Encoder block:
  → Multi-Head Self-Attention
  → Add & Norm
  → Feed-Forward Network
  → Add & Norm

Decoder block:
  → Masked Multi-Head Self-Attention (can't peek ahead)
  → Add & Norm
  → Cross-Attention (attends to encoder output)
  → Add & Norm
  → Feed-Forward Network
  → Add & Norm

Why It Matters

This paper is the foundation of everything in modern AI:

  • GPT = decoder-only Transformer
  • BERT = encoder-only Transformer
  • T5 = full encoder-decoder Transformer
  • Vision Transformer (ViT) = attention applied to image patches
  • All modern LLMs are Transformers

Raw Notes

  • Original task was machine translation (English → German, English → French)
  • Achieved state-of-the-art with significantly less training time
  • Key insight: attention is O(1) in sequential operations (vs O(n) for RNNs)
  • The title is a statement of confidence — “you don’t need anything else”

Questions / Follow-up

  • How do modern architectures differ from the original? (no encoder for GPT, longer contexts, RoPE instead of sinusoidal)
  • What is FlashAttention and how does it make attention more efficient?
  • Explore the “attention sink” phenomenon
enes