Embeddings
Dense vector representations that capture the semantic meaning of text, images, or other data in a numerical space where proximity reflects conceptual similarity.
What it is
An embedding is a numerical representation of data — text, image, audio — as a fixed-dimension dense vector. The fundamental property is that semantically similar data produces nearby vectors in the space, while different data ends up far apart.
For example, the embeddings for "dog" and "puppy" will be close together, while "dog" and "economics" will be far apart. This allows machines to operate on "meaning" mathematically.
How they work
Generation
An embedding model (like text-embedding-3-small from OpenAI or all-MiniLM-L6-v2 from Sentence Transformers) takes input text and produces a fixed-dimension vector — typically between 384 and 3,072 dimensions.
The model learns these representations during training, optimizing so that texts with similar meaning produce nearby vectors.
Similarity metrics
To compare embeddings, distance metrics are used:
- Cosine similarity: measures the angle between vectors (most common)
- Dot product: similar to cosine but sensitive to magnitude
- Euclidean distance: direct geometric distance between points
Types of embeddings
- Word embeddings: one vector per word (Word2Vec, GloVe) — historical but limited
- Sentence embeddings: one vector per sentence or paragraph — the current standard
- Multimodal: vectors representing text and images in the same space (CLIP)
Applications
| Application | How it uses embeddings | Similarity metric |
|---|---|---|
| Semantic search | Compares query embedding with document embeddings | Cosine similarity |
| RAG | Retrieves relevant chunks to give context to the LLM | Cosine similarity + reranking |
| Classification | Groups documents by proximity in vector space | Euclidean distance or cosine |
| Duplicate detection | Identifies content with high similarity | Similarity threshold (> 0.9) |
| Recommendations | Suggests content close to user profile | k-nearest neighbors |
Practical considerations
- Dimensionality vs. performance: more dimensions capture more nuance but require more storage and compute
- Model matters: the same text produces different embeddings with different models — they're not interchangeable
- Chunking: for long documents, it's better to generate embeddings per chunk than per complete document
- Normalization: some models require normalizing vectors before comparison
Why it matters
Embeddings are the foundation of semantic search, RAG systems, and content classification. Without them, AI applications are limited to exact text matching. Understanding their properties — dimensionality, cosine distance, language limitations — is essential for building effective information retrieval systems.
References
- Efficient Estimation of Word Representations in Vector Space — Mikolov et al., 2013. The original Word2Vec paper.
- Sentence-BERT — Reimers & Gurevych, 2019. Efficient sentence embeddings based on BERT.
- Text Embeddings by Weakly-Supervised Contrastive Pre-training — Wang et al., 2022. E5, general-purpose text embeddings.