RAG & Retrieval
RAG & Retrieval
Here’s the problem with AI models: they can only know what they were trained on. Ask Claude about your company’s internal docs and it has no idea. Ask about something that happened last week and it might hallucinate an answer that sounds confident but is completely wrong.
RAG fixes this. Retrieval-Augmented Generation gives a model access to your knowledge — documents, databases, wikis, whatever — by finding relevant information and injecting it into the prompt before the model responds.
It’s the difference between asking someone to answer from memory vs giving them the reference book first.
How It Works
The whole thing is surprisingly simple in concept:
1. INGEST → Turn your documents into searchable vectors
2. QUERY → User asks a question
3. RETRIEVE → Find the most relevant chunks
4. AUGMENT → Stuff them into the prompt as context
5. GENERATE → Model answers, grounded in real information That’s it. The magic is in the details of each step.
Step 1: Ingestion
You take your documents and prepare them for search:
Chunk — Split into digestible pieces (256–1024 tokens typically). Too small and you lose context. Too big and you dilute relevance. This is more art than science.
Embed — Run each chunk through an embedding model to get a vector — a numerical fingerprint of its meaning.
Store — Put those vectors in a vector database where you can search by similarity.
Step 2: Retrieval
When someone asks a question:
- Embed the question (same model, same space)
- Search for the nearest vectors — chunks whose meaning is closest to the question
- Optionally re-rank results with a more expensive model for precision
Step 3: Generation
Take the best chunks, put them in the prompt (“Here is relevant context: …”), and let the LLM answer based on that context. Now it’s grounded. It can cite sources. It’s much less likely to hallucinate.
Why RAG Matters
It solves the biggest practical problems with LLMs:
| Problem | How RAG helps |
|---|---|
| Hallucination | Answers grounded in real documents |
| Stale knowledge | Access info newer than training data |
| Private data | Use your own docs without fine-tuning |
| Verifiability | Can cite sources (“According to page 3…“) |
| Cost | Way cheaper than fine-tuning for adding knowledge |
If you’re building anything where accuracy matters — customer support, legal research, internal tools, knowledge bases — you probably want RAG.
The Stack
You don’t need all of these, but this is the landscape:
| Layer | Options |
|---|---|
| Embedding model | OpenAI text-embedding-3, BGE, Nomic, Cohere |
| Vector database | Pinecone, Weaviate, ChromaDB, pgvector, Qdrant |
| Chunking | LangChain splitters, LlamaIndex parsers, custom |
| Orchestration | LangChain, LlamaIndex, Vercel AI SDK, custom |
| LLM | Claude, GPT-4, LLaMA, Mistral |
For getting started, ChromaDB + an open embedding model + Claude is the simplest path that actually works well.
Advanced Patterns
Once basic RAG works, you’ll hit its limitations. These patterns address them:
| Pattern | What it solves |
|---|---|
| Hybrid search | Combine vectors + keyword search — catches what vectors miss |
| Re-ranking | Use a cross-encoder to re-score after initial retrieval |
| Multi-query | Rephrase the question 3 ways, retrieve for each, merge results |
| HyDE | Generate a hypothetical answer first, use that as the search query |
| Graph RAG | Combine vector search with a knowledge graph for relationships |
| Self-RAG | Model decides whether to retrieve (not always needed) |
I find hybrid search (vectors + BM25 keyword matching) gives the biggest improvement for the least complexity. Pure vector search sometimes misses obvious keyword matches.
What I’m Still Learning
- Optimal chunk sizing for different document types (code vs prose vs legal text)
- When to use RAG vs fine-tuning vs just a longer context window
- How evaluation works — measuring RAG quality is surprisingly hard
- Graph RAG — combining structured knowledge with vector retrieval
Go Deeper
- Embeddings — The foundation that makes vector search possible
- AI Agents — Agents often use RAG as one of their tools
- Tools & Frameworks — Practical libraries for building RAG pipelines
- Prompt Engineering — How you structure the augmented prompt matters enormously
- How LLMs Work — Understanding context windows and token limits
Best Resources
- LlamaIndex documentation — Most comprehensive RAG framework
- LangChain RAG tutorials — Good for learning the patterns
- Pinecone Learning Centre — Practical vector search guides
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” — The original 2020 paper that named the pattern