Embeddings are dense vector representations that capture semantic meaning. They map discrete objects (words, sentences, images, code) into continuous vector spaces where similar items are close together.
Core Concepts
Vector Space
An embedding maps input to a high-dimensional vector (typically 384-3072 dimensions). In this space:
- Cosine similarity — Measures angle between vectors (most common)
- Euclidean distance — Measures straight-line distance
- Dot product — Combines magnitude and angle
Semantic Similarity
Similar concepts cluster together:
- “king” and “queen” are close
- “king” - “man” + “woman” ≈ “queen”
- Code implementing the same algorithm clusters regardless of variable names
Properties
- Dense — Most values are non-zero (unlike sparse bag-of-words)
- Learned — Representations emerge from training
- Transferable — Embeddings from one task often work for others
Embedding Models
Text Embeddings
Models trained to encode text semantics.
| Model | Dimensions | Context | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | Strongest commercial |
| OpenAI text-embedding-3-small | 1536 | 8191 | Cost-effective |
| Cohere embed-v3 | 1024 | 512 | Multilingual |
| Voyage-3 | 1024 | 32000 | Long context |
| BGE-large | 1024 | 512 | Open-source SOTA |
| E5-mistral-7b | 4096 | 32768 | LLM-based, powerful |
| Nomic-embed-text | 768 | 8192 | Open, long context |
| all-MiniLM-L6-v2 | 384 | 256 | Fast, lightweight |
Code Embeddings
Specialised for source code:
- Voyage Code — Code-specific
- CodeBERT
- StarEncoder
Multimodal Embeddings
Joint text-image spaces:
- CLIP (OpenAI) — Text and images in shared space
- SigLIP (Google) — Improved CLIP variant
- ImageBind (Meta) — Six modalities in one space
Applications
Semantic Search
Find relevant documents by meaning, not just keywords.
Query: "How do I fix a slow database?"
Matches: "Optimising PostgreSQL query performance" (semantically similar)
Retrieval-Augmented Generation
Core of RAG systems — retrieve relevant context for LLM generation.
Clustering & Classification
- Group similar documents automatically
- Zero-shot classification by comparing to class label embeddings
- Anomaly detection (outliers in embedding space)
Recommendations
- “More like this” by finding nearest neighbours
- User-item matching in recommendation systems
Deduplication
Find near-duplicate content by similarity threshold.
Embedding Pipeline
Chunking
See Chunking Strategies. Chunk size affects embedding quality:
- Too short: loses context
- Too long: dilutes specific information
Normalisation
Most models output normalised vectors (unit length). If not, normalise before computing cosine similarity.
Dimensionality Reduction
Reduce dimensions for efficiency:
- PCA — Linear, fast
- UMAP — Non-linear, preserves structure
- Matryoshka embeddings — Models trained to truncate gracefully
Batching
Embed in batches for efficiency. Most APIs accept batch inputs.
Practical Considerations
Choosing a Model
Factors:
- Quality — Check MTEB leaderboard
- Dimensions — Higher = more expressive but more storage/compute
- Context length — Must fit your chunks
- Speed — Latency requirements
- Cost — API pricing or self-hosting costs
Storage
See Vector Databases. Consider:
- Quantisation (reduce precision to save space)
- Approximate nearest neighbour (ANN) for speed
Versioning
Embeddings from different models aren’t comparable. If you change embedding models:
- Re-embed all documents
- Or maintain separate indices per model version
Evaluation
Benchmarks
- MTEB (Massive Text Embedding Benchmark) — Standard leaderboard
- BEIR — Information retrieval benchmark
Metrics
- Retrieval: Recall@K, NDCG, MRR
- Clustering: Silhouette score, V-measure
- Classification: Accuracy, F1
Tools & Libraries
- Sentence Transformers — Python library for embeddings
- LangChain Embeddings
- LlamaIndex
- Hugging Face TEI — Text Embeddings Inference server
API Providers
Resources
- What are Embeddings? — Comprehensive explainer
- MTEB Leaderboard
- Sentence-BERT Paper
- OpenAI Embeddings Guide