Embeddings are dense vector representations that capture semantic meaning. They map discrete objects (words, sentences, images, code) into continuous vector spaces where similar items are close together.

Core Concepts

Vector Space

An embedding maps input to a high-dimensional vector (typically 384-3072 dimensions). In this space:

  • Cosine similarity — Measures angle between vectors (most common)
  • Euclidean distance — Measures straight-line distance
  • Dot product — Combines magnitude and angle

Semantic Similarity

Similar concepts cluster together:

  • “king” and “queen” are close
  • “king” - “man” + “woman” ≈ “queen”
  • Code implementing the same algorithm clusters regardless of variable names

Properties

  • Dense — Most values are non-zero (unlike sparse bag-of-words)
  • Learned — Representations emerge from training
  • Transferable — Embeddings from one task often work for others

Embedding Models

Text Embeddings

Models trained to encode text semantics.

ModelDimensionsContextNotes
OpenAI text-embedding-3-large30728191Strongest commercial
OpenAI text-embedding-3-small15368191Cost-effective
Cohere embed-v31024512Multilingual
Voyage-3102432000Long context
BGE-large1024512Open-source SOTA
E5-mistral-7b409632768LLM-based, powerful
Nomic-embed-text7688192Open, long context
all-MiniLM-L6-v2384256Fast, lightweight

Code Embeddings

Specialised for source code:

Multimodal Embeddings

Joint text-image spaces:

  • CLIP (OpenAI) — Text and images in shared space
  • SigLIP (Google) — Improved CLIP variant
  • ImageBind (Meta) — Six modalities in one space

Applications

Find relevant documents by meaning, not just keywords.

Query: "How do I fix a slow database?"
Matches: "Optimising PostgreSQL query performance" (semantically similar)

Retrieval-Augmented Generation

Core of RAG systems — retrieve relevant context for LLM generation.

Clustering & Classification

  • Group similar documents automatically
  • Zero-shot classification by comparing to class label embeddings
  • Anomaly detection (outliers in embedding space)

Recommendations

  • “More like this” by finding nearest neighbours
  • User-item matching in recommendation systems

Deduplication

Find near-duplicate content by similarity threshold.

Embedding Pipeline

Chunking

See Chunking Strategies. Chunk size affects embedding quality:

  • Too short: loses context
  • Too long: dilutes specific information

Normalisation

Most models output normalised vectors (unit length). If not, normalise before computing cosine similarity.

Dimensionality Reduction

Reduce dimensions for efficiency:

  • PCA — Linear, fast
  • UMAP — Non-linear, preserves structure
  • Matryoshka embeddings — Models trained to truncate gracefully

Batching

Embed in batches for efficiency. Most APIs accept batch inputs.

Practical Considerations

Choosing a Model

Factors:

  • Quality — Check MTEB leaderboard
  • Dimensions — Higher = more expressive but more storage/compute
  • Context length — Must fit your chunks
  • Speed — Latency requirements
  • Cost — API pricing or self-hosting costs

Storage

See Vector Databases. Consider:

  • Quantisation (reduce precision to save space)
  • Approximate nearest neighbour (ANN) for speed

Versioning

Embeddings from different models aren’t comparable. If you change embedding models:

  • Re-embed all documents
  • Or maintain separate indices per model version

Evaluation

Benchmarks

  • MTEB (Massive Text Embedding Benchmark) — Standard leaderboard
  • BEIR — Information retrieval benchmark

Metrics

  • Retrieval: Recall@K, NDCG, MRR
  • Clustering: Silhouette score, V-measure
  • Classification: Accuracy, F1

Tools & Libraries

API Providers

Resources