RAG

Reference notes.

Retrieval-Augmented Generation (RAG) combines the generative capabilities of LLMs with external knowledge retrieval. Instead of relying solely on knowledge baked into model weights, RAG retrieves relevant documents at inference time and includes them in the context.

Why RAG?

Current information — Access data beyond the training cutoff
Domain specificity — Ground responses in your proprietary data
Reduced hallucinations — Answers are anchored to retrieved sources
Auditability — Can cite sources for generated content
Cost efficiency — Cheaper than fine-tuning for knowledge updates

Basic Architecture

User Query → Embedding → Vector Search → Retrieved Docs → LLM → Response
                              ↑
                        Vector Database
                        (pre-indexed docs)

Indexing (offline): Documents are chunked, embedded, and stored in a vector database
Retrieval (online): User query is embedded, similar documents are retrieved
Generation (online): Retrieved context + query are sent to the LLM

Chunking Strategies

How you split documents significantly impacts retrieval quality.

Fixed-size Chunking

Split by character or token count with overlap.

Simple to implement
May break semantic boundaries

Semantic Chunking

Split at natural boundaries (paragraphs, sections, sentences).

Preserves meaning
Variable chunk sizes

Recursive Chunking

Try splitting by larger units first (sections), then smaller (paragraphs, sentences) if chunks are too large.

Document-aware Chunking

Respect document structure (headers, code blocks, tables).

Better for structured content
Requires format-specific parsing

Chunk Size Trade-offs

Too small: Loses context, more retrieval noise
Too large: Dilutes relevance, wastes context window
Typical range: 256–1024 tokens with 10–20% overlap

Retrieval Methods

Dense Retrieval (Semantic)

Uses Embeddings to find semantically similar content. Good for conceptual queries but may miss exact matches.

Sparse Retrieval (Lexical)

Traditional keyword matching (BM25, TF-IDF). Good for exact terms, proper nouns, and technical identifiers.

Hybrid Retrieval

Combines dense and sparse retrieval, often with Reciprocal Rank Fusion (RRF) to merge results. Usually outperforms either alone.

Reranking & Late Interaction

A second-stage model scores retrieved documents for relevance. More expensive but improves precision.

Common approaches:

Cross-encoders — High accuracy but slow. Models like Cohere Rerank or Jina Reranker.
Late Interaction (ColBERT) — Embeds query and document at token level and compares them at the end. Balances the speed of dense retrieval with the precision of cross-encoders.

Advanced RAG Patterns

Query Transformation

Improve retrieval by modifying the original query:

Query expansion — Add related terms
HyDE (Hypothetical Document Embeddings) — Generate a hypothetical answer, embed that instead
Multi-query — Generate query variations, retrieve for each, merge results

Parent-Child Retrieval

Store small chunks for precise retrieval, but return the parent document (larger context) to the LLM.

Contextual Compression

After retrieval, extract only the relevant portions of documents before sending to the LLM.

Self-RAG

The model decides when to retrieve, what to retrieve, and whether retrieved content is useful.

GraphRAG

Combines vector retrieval with knowledge graphs for relationship-aware retrieval. Useful for multi-hop reasoning.

Contextual Retrieval

Prepend chunk-specific context (generated by an LLM) to each chunk before embedding. For example, a chunk about “Q3 revenue” gets prefixed with “This chunk is from Acme Corp’s 2024 annual report, financial results section.” This dramatically improves retrieval accuracy by reducing ambiguity.

Late Chunking

Embed the entire document first using a long-context embedding model, then chunk the resulting token-level embeddings. Each chunk’s embedding retains awareness of the full document context, unlike traditional chunk-then-embed pipelines. Requires long-context embedding models (e.g., Jina embeddings-v3).

Cross-Granularity Retrieval

Index at sentence-atomic level, then assemble context at query time. Avoids boundary failures by letting the retrieval system dynamically determine the right window of context for each query.

Agentic RAG

An agent decides retrieval strategy dynamically:

Which sources to query
Whether to retrieve more
How to combine information

Evaluation

Retrieval Metrics

Precision@K — Fraction of retrieved docs that are relevant
Recall@K — Fraction of relevant docs that were retrieved
MRR (Mean Reciprocal Rank) — Position of first relevant result
NDCG — Considers graded relevance and ranking

Generation Metrics

Faithfulness — Is the response supported by retrieved context?
Relevance — Does the response answer the query?
Groundedness — Are claims traceable to sources?

Tools

RAGAS — RAG evaluation framework
DeepEval
Arize Phoenix

Common Failure Modes

Retrieval miss — Relevant documents not retrieved
Context overflow — Too many documents exceed context window
Lost in the middle — Model ignores documents in the middle of context
Conflicting information — Retrieved docs contradict each other
Over-reliance — Model ignores its own knowledge when retrieval is wrong

Rai Notes

Explorer

RAG