Concepts

Embeddings

Dense vector representations that capture the semantic meaning of text, images, or other data in a numerical space where proximity reflects conceptual similarity.

seed#embeddings#vectors#nlp#semantic-similarity#representation-learning

What it is

An embedding is a numerical representation of data — text, image, audio — as a fixed-dimension dense vector. The fundamental property is that semantically similar data produces nearby vectors in the space, while different data ends up far apart.

For example, the embeddings for "dog" and "puppy" will be close together, while "dog" and "economics" will be far apart. This allows machines to operate on "meaning" mathematically.

How they work

Generation

An embedding model (like text-embedding-3-small from OpenAI or all-MiniLM-L6-v2 from Sentence Transformers) takes input text and produces a fixed-dimension vector — typically between 384 and 3,072 dimensions.

The model learns these representations during training, optimizing so that texts with similar meaning produce nearby vectors.

Similarity metrics

To compare embeddings, distance metrics are used:

  • Cosine similarity: measures the angle between vectors (most common)
  • Dot product: similar to cosine but sensitive to magnitude
  • Euclidean distance: direct geometric distance between points

Types of embeddings

  • Word embeddings: one vector per word (Word2Vec, GloVe) — historical but limited
  • Sentence embeddings: one vector per sentence or paragraph — the current standard
  • Multimodal: vectors representing text and images in the same space (CLIP)

Applications

ApplicationHow it uses embeddingsSimilarity metric
Semantic searchCompares query embedding with document embeddingsCosine similarity
RAGRetrieves relevant chunks to give context to the LLMCosine similarity + reranking
ClassificationGroups documents by proximity in vector spaceEuclidean distance or cosine
Duplicate detectionIdentifies content with high similaritySimilarity threshold (> 0.9)
RecommendationsSuggests content close to user profilek-nearest neighbors

Practical considerations

  • Dimensionality vs. performance: more dimensions capture more nuance but require more storage and compute
  • Model matters: the same text produces different embeddings with different models — they're not interchangeable
  • Chunking: for long documents, it's better to generate embeddings per chunk than per complete document
  • Normalization: some models require normalizing vectors before comparison

Why it matters

Embeddings are the foundation of semantic search, RAG systems, and content classification. Without them, AI applications are limited to exact text matching. Understanding their properties — dimensionality, cosine distance, language limitations — is essential for building effective information retrieval systems.

References

Concepts