Retrieval-Augmented Generation (RAG) combines the generative capabilities of LLMs with external knowledge retrieval. Instead of relying solely on knowledge baked into model weights, RAG retrieves relevant documents at inference time and includes them in the context.
Why RAG?
- Current information — Access data beyond the training cutoff
- Domain specificity — Ground responses in your proprietary data
- Reduced hallucinations — Answers are anchored to retrieved sources
- Auditability — Can cite sources for generated content
- Cost efficiency — Cheaper than fine-tuning for knowledge updates
Basic Architecture
User Query → Embedding → Vector Search → Retrieved Docs → LLM → Response
↑
Vector Database
(pre-indexed docs)
- Indexing (offline): Documents are chunked, embedded, and stored in a vector database
- Retrieval (online): User query is embedded, similar documents are retrieved
- Generation (online): Retrieved context + query are sent to the LLM
Chunking Strategies
How you split documents significantly impacts retrieval quality.
Fixed-size Chunking
Split by character or token count with overlap.
- Simple to implement
- May break semantic boundaries
Semantic Chunking
Split at natural boundaries (paragraphs, sections, sentences).
- Preserves meaning
- Variable chunk sizes
Recursive Chunking
Try splitting by larger units first (sections), then smaller (paragraphs, sentences) if chunks are too large.
Document-aware Chunking
Respect document structure (headers, code blocks, tables).
- Better for structured content
- Requires format-specific parsing
Chunk Size Trade-offs
- Too small: Loses context, more retrieval noise
- Too large: Dilutes relevance, wastes context window
- Typical range: 256–1024 tokens with 10–20% overlap
Retrieval Methods
Dense Retrieval (Semantic)
Uses Embeddings to find semantically similar content. Good for conceptual queries but may miss exact matches.
Sparse Retrieval (Lexical)
Traditional keyword matching (BM25, TF-IDF). Good for exact terms, proper nouns, and technical identifiers.
Hybrid Retrieval
Combines dense and sparse retrieval, often with Reciprocal Rank Fusion (RRF) to merge results. Usually outperforms either alone.
Reranking
A second-stage model scores retrieved documents for relevance. More expensive but improves precision.
Common rerankers:
- Cohere Rerank
- Jina Reranker
- Cross-encoder models
Advanced RAG Patterns
Query Transformation
Improve retrieval by modifying the original query:
- Query expansion — Add related terms
- HyDE (Hypothetical Document Embeddings) — Generate a hypothetical answer, embed that instead
- Multi-query — Generate query variations, retrieve for each, merge results
Parent-Child Retrieval
Store small chunks for precise retrieval, but return the parent document (larger context) to the LLM.
Contextual Compression
After retrieval, extract only the relevant portions of documents before sending to the LLM.
Self-RAG
The model decides when to retrieve, what to retrieve, and whether retrieved content is useful.
GraphRAG
Combines vector retrieval with knowledge graphs for relationship-aware retrieval. Useful for multi-hop reasoning.
Agentic RAG
An agent decides retrieval strategy dynamically:
- Which sources to query
- Whether to retrieve more
- How to combine information
Evaluation
Retrieval Metrics
- Precision@K — Fraction of retrieved docs that are relevant
- Recall@K — Fraction of relevant docs that were retrieved
- MRR (Mean Reciprocal Rank) — Position of first relevant result
- NDCG — Considers graded relevance and ranking
Generation Metrics
- Faithfulness — Is the response supported by retrieved context?
- Relevance — Does the response answer the query?
- Groundedness — Are claims traceable to sources?
Tools
- RAGAS — RAG evaluation framework
- DeepEval
- Arize Phoenix
Common Failure Modes
- Retrieval miss — Relevant documents not retrieved
- Context overflow — Too many documents exceed context window
- Lost in the middle — Model ignores documents in the middle of context
- Conflicting information — Retrieved docs contradict each other
- Over-reliance — Model ignores its own knowledge when retrieval is wrong
Implementation Stack
Frameworks
Vector Databases
See Vector Databases
Embedding Models
See Embeddings
Resources
- RAG Paper — Original RAG paper
- Anthropic RAG Guide
- Advanced RAG Techniques
- 12 RAG Pain Points