Embeddings

Reference notes.

Embeddings are dense vector representations that capture semantic meaning. They map discrete objects (words, sentences, images, code) into continuous vector spaces where similar items are close together.

Core Concepts

Vector Space

An embedding maps input to a high-dimensional vector (typically 384-3072 dimensions). In this space:

Cosine similarity — Measures angle between vectors (most common)
Euclidean distance — Measures straight-line distance
Dot product — Combines magnitude and angle

Semantic Similarity

Similar concepts cluster together:

“king” and “queen” are close
“king” - “man” + “woman” ≈ “queen”
Code implementing the same algorithm clusters regardless of variable names

Properties

Dense — Most values are non-zero (unlike sparse bag-of-words)
Learned — Representations emerge from training
Transferable — Embeddings from one task often work for others

Embedding Models

Text Embeddings

Models trained to encode text semantics.

Model	Dimensions	Context	Notes
Qwen3-Embedding-8B	Various	—	Top MTEB multilingual (70.58), 100+ languages
Cohere embed-v4	1024	—	Top MTEB English (65.2), multimodal
OpenAI text-embedding-3-large	3072	8191	Strong commercial, Matryoshka support
OpenAI text-embedding-3-small	1536	8191	Cost-effective
NVIDIA NV-Embed	Various	—	Fine-tuned from Llama-3.1-8B, strong multilingual
EmbeddingGemma-300M	—	—	Google DeepMind, lightweight, on-device
Voyage-3 / Voyage-3-large	1024	32000	Long context, strong retrieval
Jina embeddings-v3	1024	8192	Open, task-specific LoRA adapters
BGE-M3	1024	8192	Top open-source (63.0 MTEB), 100+ languages
GTE-Qwen2	1536	32768	LLM-based, strong MTEB performance
Nomic-embed-text-v1.5	768	8192	Open, Matryoshka support
all-MiniLM-L6-v2	384	256	Fast, lightweight

Code Embeddings

Specialised for source code:

Voyage Code 3 — Code-specific, strong retrieval
Qwen2.5-Coder-Embedding — Open, multilingual code
CodeBERT

Multimodal Embeddings

Joint text-image spaces:

CLIP (OpenAI) — Text and images in shared space
SigLIP (Google) — Improved CLIP variant
ImageBind (Meta) — Six modalities in one space

Applications

Semantic Search

Find relevant documents by meaning, not just keywords.

Query: "How do I fix a slow database?"
Matches: "Optimising PostgreSQL query performance" (semantically similar)

Retrieval-Augmented Generation

Core of RAG systems — retrieve relevant context for LLM generation.

Clustering & Classification

Group similar documents automatically
Zero-shot classification by comparing to class label embeddings
Anomaly detection (outliers in embedding space)

Recommendations

“More like this” by finding nearest neighbours
User-item matching in recommendation systems

Deduplication

Find near-duplicate content by similarity threshold.

Embedding Pipeline

Chunking

See Chunking Strategies. Chunk size affects embedding quality:

Too short: loses context
Too long: dilutes specific information

Normalisation

Most models output normalised vectors (unit length). If not, normalise before computing cosine similarity.

Dimensionality Reduction

Reduce dimensions for efficiency:

PCA — Linear, fast
UMAP — Non-linear, preserves structure
Matryoshka embeddings — Models trained to truncate gracefully

Batching

Embed in batches for efficiency. Most APIs accept batch inputs.

Practical Considerations

Choosing a Model

Factors:

Quality — Check MTEB leaderboard
Dimensions — Higher = more expressive but more storage/compute
Context length — Must fit your chunks
Speed — Latency requirements
Cost — API pricing or self-hosting costs

Storage

See Vector Databases. Consider:

Quantisation (reduce precision to save space)
Approximate nearest neighbour (ANN) for speed

Versioning

Embeddings from different models aren’t comparable. If you change embedding models:

Re-embed all documents
Or maintain separate indices per model version

Evaluation

Benchmarks

MTEB (Massive Text Embedding Benchmark) — Standard leaderboard
BEIR — Information retrieval benchmark

Metrics

Retrieval: Recall@K, NDCG, MRR
Clustering: Silhouette score, V-measure
Classification: Accuracy, F1

Tools & Libraries

Sentence Transformers — Python library for embeddings
LangChain Embeddings
LlamaIndex
Hugging Face TEI — Text Embeddings Inference server

API Providers

Resources

What are Embeddings? — Comprehensive explainer
MTEB Leaderboard
Sentence-BERT Paper
OpenAI Embeddings Guide

Rai Notes

Explorer