Reference notes.

Embeddings are dense vector representations that capture semantic meaning. They map discrete objects (words, sentences, images, code) into continuous vector spaces where similar items are close together.

Core Concepts

Vector Space

An embedding maps input to a high-dimensional vector (typically 384-3072 dimensions). In this space:

  • Cosine similarity — Measures angle between vectors (most common)
  • Euclidean distance — Measures straight-line distance
  • Dot product — Combines magnitude and angle

Semantic Similarity

Similar concepts cluster together:

  • “king” and “queen” are close
  • “king” - “man” + “woman” ≈ “queen”
  • Code implementing the same algorithm clusters regardless of variable names

Properties

  • Dense — Most values are non-zero (unlike sparse bag-of-words)
  • Learned — Representations emerge from training
  • Transferable — Embeddings from one task often work for others

Embedding Models

Text Embeddings

Models trained to encode text semantics.

ModelDimensionsContextNotes
Qwen3-Embedding-8BVariousTop MTEB multilingual (70.58), 100+ languages
Cohere embed-v41024Top MTEB English (65.2), multimodal
OpenAI text-embedding-3-large30728191Strong commercial, Matryoshka support
OpenAI text-embedding-3-small15368191Cost-effective
NVIDIA NV-EmbedVariousFine-tuned from Llama-3.1-8B, strong multilingual
EmbeddingGemma-300MGoogle DeepMind, lightweight, on-device
Voyage-3 / Voyage-3-large102432000Long context, strong retrieval
Jina embeddings-v310248192Open, task-specific LoRA adapters
BGE-M310248192Top open-source (63.0 MTEB), 100+ languages
GTE-Qwen2153632768LLM-based, strong MTEB performance
Nomic-embed-text-v1.57688192Open, Matryoshka support
all-MiniLM-L6-v2384256Fast, lightweight

Code Embeddings

Specialised for source code:

Multimodal Embeddings

Joint text-image spaces:

  • CLIP (OpenAI) — Text and images in shared space
  • SigLIP (Google) — Improved CLIP variant
  • ImageBind (Meta) — Six modalities in one space

Applications

Find relevant documents by meaning, not just keywords.

Query: "How do I fix a slow database?"
Matches: "Optimising PostgreSQL query performance" (semantically similar)

Retrieval-Augmented Generation

Core of RAG systems — retrieve relevant context for LLM generation.

Clustering & Classification

  • Group similar documents automatically
  • Zero-shot classification by comparing to class label embeddings
  • Anomaly detection (outliers in embedding space)

Recommendations

  • “More like this” by finding nearest neighbours
  • User-item matching in recommendation systems

Deduplication

Find near-duplicate content by similarity threshold.

Embedding Pipeline

Chunking

See Chunking Strategies. Chunk size affects embedding quality:

  • Too short: loses context
  • Too long: dilutes specific information

Normalisation

Most models output normalised vectors (unit length). If not, normalise before computing cosine similarity.

Dimensionality Reduction

Reduce dimensions for efficiency:

  • PCA — Linear, fast
  • UMAP — Non-linear, preserves structure
  • Matryoshka embeddings — Models trained to truncate gracefully

Batching

Embed in batches for efficiency. Most APIs accept batch inputs.

Practical Considerations

Choosing a Model

Factors:

  • Quality — Check MTEB leaderboard
  • Dimensions — Higher = more expressive but more storage/compute
  • Context length — Must fit your chunks
  • Speed — Latency requirements
  • Cost — API pricing or self-hosting costs

Storage

See Vector Databases. Consider:

  • Quantisation (reduce precision to save space)
  • Approximate nearest neighbour (ANN) for speed

Versioning

Embeddings from different models aren’t comparable. If you change embedding models:

  • Re-embed all documents
  • Or maintain separate indices per model version

Evaluation

Benchmarks

  • MTEB (Massive Text Embedding Benchmark) — Standard leaderboard
  • BEIR — Information retrieval benchmark

Metrics

  • Retrieval: Recall@K, NDCG, MRR
  • Clustering: Silhouette score, V-measure
  • Classification: Accuracy, F1

Tools & Libraries

API Providers

See Also

Resources