LEARNING

RAG & Retrieval

Created 2 May 2025
learningragretrievalvector-searchknowledge

RAG & Retrieval

Here’s the problem with AI models: they can only know what they were trained on. Ask Claude about your company’s internal docs and it has no idea. Ask about something that happened last week and it might hallucinate an answer that sounds confident but is completely wrong.

RAG fixes this. Retrieval-Augmented Generation gives a model access to your knowledge — documents, databases, wikis, whatever — by finding relevant information and injecting it into the prompt before the model responds.

It’s the difference between asking someone to answer from memory vs giving them the reference book first.


How It Works

The whole thing is surprisingly simple in concept:

1. INGEST   → Turn your documents into searchable vectors
2. QUERY    → User asks a question
3. RETRIEVE → Find the most relevant chunks
4. AUGMENT  → Stuff them into the prompt as context
5. GENERATE → Model answers, grounded in real information

That’s it. The magic is in the details of each step.

Step 1: Ingestion

You take your documents and prepare them for search:

Chunk — Split into digestible pieces (256–1024 tokens typically). Too small and you lose context. Too big and you dilute relevance. This is more art than science.

Embed — Run each chunk through an embedding model to get a vector — a numerical fingerprint of its meaning.

Store — Put those vectors in a vector database where you can search by similarity.

Step 2: Retrieval

When someone asks a question:

  1. Embed the question (same model, same space)
  2. Search for the nearest vectors — chunks whose meaning is closest to the question
  3. Optionally re-rank results with a more expensive model for precision

Step 3: Generation

Take the best chunks, put them in the prompt (“Here is relevant context: …”), and let the LLM answer based on that context. Now it’s grounded. It can cite sources. It’s much less likely to hallucinate.


Why RAG Matters

It solves the biggest practical problems with LLMs:

ProblemHow RAG helps
HallucinationAnswers grounded in real documents
Stale knowledgeAccess info newer than training data
Private dataUse your own docs without fine-tuning
VerifiabilityCan cite sources (“According to page 3…“)
CostWay cheaper than fine-tuning for adding knowledge

If you’re building anything where accuracy matters — customer support, legal research, internal tools, knowledge bases — you probably want RAG.


The Stack

You don’t need all of these, but this is the landscape:

LayerOptions
Embedding modelOpenAI text-embedding-3, BGE, Nomic, Cohere
Vector databasePinecone, Weaviate, ChromaDB, pgvector, Qdrant
ChunkingLangChain splitters, LlamaIndex parsers, custom
OrchestrationLangChain, LlamaIndex, Vercel AI SDK, custom
LLMClaude, GPT-4, LLaMA, Mistral

For getting started, ChromaDB + an open embedding model + Claude is the simplest path that actually works well.


Advanced Patterns

Once basic RAG works, you’ll hit its limitations. These patterns address them:

PatternWhat it solves
Hybrid searchCombine vectors + keyword search — catches what vectors miss
Re-rankingUse a cross-encoder to re-score after initial retrieval
Multi-queryRephrase the question 3 ways, retrieve for each, merge results
HyDEGenerate a hypothetical answer first, use that as the search query
Graph RAGCombine vector search with a knowledge graph for relationships
Self-RAGModel decides whether to retrieve (not always needed)

I find hybrid search (vectors + BM25 keyword matching) gives the biggest improvement for the least complexity. Pure vector search sometimes misses obvious keyword matches.


What I’m Still Learning

  • Optimal chunk sizing for different document types (code vs prose vs legal text)
  • When to use RAG vs fine-tuning vs just a longer context window
  • How evaluation works — measuring RAG quality is surprisingly hard
  • Graph RAG — combining structured knowledge with vector retrieval

Go Deeper

Best Resources

  • LlamaIndex documentation — Most comprehensive RAG framework
  • LangChain RAG tutorials — Good for learning the patterns
  • Pinecone Learning Centre — Practical vector search guides
  • “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” — The original 2020 paper that named the pattern
enes