LEARNING

RAG & Retrieval

Created 2 May 2025

learningragretrievalvector-searchknowledge

RAG & Retrieval

Here’s the problem with AI models: they can only know what they were trained on. Ask Claude about your company’s internal docs and it has no idea. Ask about something that happened last week and it might hallucinate an answer that sounds confident but is completely wrong.

RAG fixes this. Retrieval-Augmented Generation gives a model access to your knowledge — documents, databases, wikis, whatever — by finding relevant information and injecting it into the prompt before the model responds.

It’s the difference between asking someone to answer from memory vs giving them the reference book first.

How It Works

The whole thing is surprisingly simple in concept:

1. INGEST   → Turn your documents into searchable vectors
2. QUERY    → User asks a question
3. RETRIEVE → Find the most relevant chunks
4. AUGMENT  → Stuff them into the prompt as context
5. GENERATE → Model answers, grounded in real information

That’s it. The magic is in the details of each step.

Step 1: Ingestion

You take your documents and prepare them for search:

Chunk — Split into digestible pieces (256–1024 tokens typically). Too small and you lose context. Too big and you dilute relevance. This is more art than science.

Embed — Run each chunk through an embedding model to get a vector — a numerical fingerprint of its meaning.

Store — Put those vectors in a vector database where you can search by similarity.

Step 2: Retrieval

When someone asks a question:

Embed the question (same model, same space)
Search for the nearest vectors — chunks whose meaning is closest to the question
Optionally re-rank results with a more expensive model for precision

Step 3: Generation

Take the best chunks, put them in the prompt (“Here is relevant context: …”), and let the LLM answer based on that context. Now it’s grounded. It can cite sources. It’s much less likely to hallucinate.

Why RAG Matters

It solves the biggest practical problems with LLMs:

Problem	How RAG helps
Hallucination	Answers grounded in real documents
Stale knowledge	Access info newer than training data
Private data	Use your own docs without fine-tuning
Verifiability	Can cite sources (“According to page 3…“)
Cost	Way cheaper than fine-tuning for adding knowledge

If you’re building anything where accuracy matters — customer support, legal research, internal tools, knowledge bases — you probably want RAG.

The Stack

You don’t need all of these, but this is the landscape:

Layer	Options
Embedding model	OpenAI text-embedding-3, BGE, Nomic, Cohere
Vector database	Pinecone, Weaviate, ChromaDB, pgvector, Qdrant
Chunking	LangChain splitters, LlamaIndex parsers, custom
Orchestration	LangChain, LlamaIndex, Vercel AI SDK, custom
LLM	Claude, GPT-4, LLaMA, Mistral

For getting started, ChromaDB + an open embedding model + Claude is the simplest path that actually works well.

Advanced Patterns

Once basic RAG works, you’ll hit its limitations. These patterns address them:

Pattern	What it solves
Hybrid search	Combine vectors + keyword search — catches what vectors miss
Re-ranking	Use a cross-encoder to re-score after initial retrieval
Multi-query	Rephrase the question 3 ways, retrieve for each, merge results
HyDE	Generate a hypothetical answer first, use that as the search query
Graph RAG	Combine vector search with a knowledge graph for relationships
Self-RAG	Model decides whether to retrieve (not always needed)

I find hybrid search (vectors + BM25 keyword matching) gives the biggest improvement for the least complexity. Pure vector search sometimes misses obvious keyword matches.

What I’m Still Learning

Optimal chunk sizing for different document types (code vs prose vs legal text)
When to use RAG vs fine-tuning vs just a longer context window
How evaluation works — measuring RAG quality is surprisingly hard
Graph RAG — combining structured knowledge with vector retrieval

Go Deeper

Embeddings — The foundation that makes vector search possible
AI Agents — Agents often use RAG as one of their tools
Tools & Frameworks — Practical libraries for building RAG pipelines
Prompt Engineering — How you structure the augmented prompt matters enormously
How LLMs Work — Understanding context windows and token limits

Best Resources

LlamaIndex documentation — Most comprehensive RAG framework
LangChain RAG tutorials — Good for learning the patterns
Pinecone Learning Centre — Practical vector search guides
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” — The original 2020 paper that named the pattern