Reference notes.
Retrieval-Augmented Generation (RAG) combines the generative capabilities of LLMs with external knowledge retrieval. Instead of relying solely on knowledge baked into model weights, RAG retrieves relevant documents at inference time and includes them in the context.
Why RAG?
- Current information — Access data beyond the training cutoff
- Domain specificity — Ground responses in your proprietary data
- Reduced hallucinations — Answers are anchored to retrieved sources
- Auditability — Can cite sources for generated content
- Cost efficiency — Cheaper than fine-tuning for knowledge updates
Basic Architecture
User Query → Embedding → Vector Search → Retrieved Docs → LLM → Response
↑
Vector Database
(pre-indexed docs)
- Indexing (offline): Documents are chunked, embedded, and stored in a vector database
- Retrieval (online): User query is embedded, similar documents are retrieved
- Generation (online): Retrieved context + query are sent to the LLM
Chunking Strategies
How you split documents significantly impacts retrieval quality.
Fixed-size Chunking
Split by character or token count with overlap.
- Simple to implement
- May break semantic boundaries
Semantic Chunking
Split at natural boundaries (paragraphs, sections, sentences).
- Preserves meaning
- Variable chunk sizes
Recursive Chunking
Try splitting by larger units first (sections), then smaller (paragraphs, sentences) if chunks are too large.
Document-aware Chunking
Respect document structure (headers, code blocks, tables).
- Better for structured content
- Requires format-specific parsing
Chunk Size Trade-offs
- Too small: Loses context, more retrieval noise
- Too large: Dilutes relevance, wastes context window
- Typical range: 256–1024 tokens with 10–20% overlap
Retrieval Methods
Dense Retrieval (Semantic)
Uses Embeddings to find semantically similar content. Good for conceptual queries but may miss exact matches.
Sparse Retrieval (Lexical)
Traditional keyword matching (BM25, TF-IDF). Good for exact terms, proper nouns, and technical identifiers.
Hybrid Retrieval
Combines dense and sparse retrieval, often with Reciprocal Rank Fusion (RRF) to merge results. Usually outperforms either alone.
Reranking & Late Interaction
A second-stage model scores retrieved documents for relevance. More expensive but improves precision.
Common approaches:
- Cross-encoders — High accuracy but slow. Models like Cohere Rerank or Jina Reranker.
- Late Interaction (ColBERT) — Embeds query and document at token level and compares them at the end. Balances the speed of dense retrieval with the precision of cross-encoders.
Advanced RAG Patterns
Query Transformation
Improve retrieval by modifying the original query:
- Query expansion — Add related terms
- HyDE (Hypothetical Document Embeddings) — Generate a hypothetical answer, embed that instead
- Multi-query — Generate query variations, retrieve for each, merge results
Parent-Child Retrieval
Store small chunks for precise retrieval, but return the parent document (larger context) to the LLM.
Contextual Compression
After retrieval, extract only the relevant portions of documents before sending to the LLM.
Self-RAG
The model decides when to retrieve, what to retrieve, and whether retrieved content is useful.
GraphRAG
Combines vector retrieval with knowledge graphs for relationship-aware retrieval. Useful for multi-hop reasoning.
Contextual Retrieval
Prepend chunk-specific context (generated by an LLM) to each chunk before embedding. For example, a chunk about “Q3 revenue” gets prefixed with “This chunk is from Acme Corp’s 2024 annual report, financial results section.” This dramatically improves retrieval accuracy by reducing ambiguity.
Late Chunking
Embed the entire document first using a long-context embedding model, then chunk the resulting token-level embeddings. Each chunk’s embedding retains awareness of the full document context, unlike traditional chunk-then-embed pipelines. Requires long-context embedding models (e.g., Jina embeddings-v3).
Cross-Granularity Retrieval
Index at sentence-atomic level, then assemble context at query time. Avoids boundary failures by letting the retrieval system dynamically determine the right window of context for each query.
Agentic RAG
An agent decides retrieval strategy dynamically:
- Which sources to query
- Whether to retrieve more
- How to combine information
Evaluation
Retrieval Metrics
- Precision@K — Fraction of retrieved docs that are relevant
- Recall@K — Fraction of relevant docs that were retrieved
- MRR (Mean Reciprocal Rank) — Position of first relevant result
- NDCG — Considers graded relevance and ranking
Generation Metrics
- Faithfulness — Is the response supported by retrieved context?
- Relevance — Does the response answer the query?
- Groundedness — Are claims traceable to sources?
Tools
- RAGAS — RAG evaluation framework
- DeepEval
- Arize Phoenix
Common Failure Modes
- Retrieval miss — Relevant documents not retrieved
- Context overflow — Too many documents exceed context window
- Lost in the middle — Model ignores documents in the middle of context
- Conflicting information — Retrieved docs contradict each other
- Over-reliance — Model ignores its own knowledge when retrieval is wrong
Implementation Stack
Frameworks
Vector Databases
See Vector Databases
Embedding Models
See Embeddings
See Also
- LLMs — The generative models used in the generation step
- Prompt Engineering — Techniques for effective context injection
- Fine-tuning — Alternative approach when RAG isn’t sufficient
Resources
- RAG Paper — Original RAG paper
- Anthropic RAG Guide
- Advanced RAG Techniques
- 12 RAG Pain Points