Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Embeddings

Dense vector representations that capture the semantic meaning of text, images, or other data in a numerical space where proximity reflects conceptual similarity.

evergreen#embeddings#vectors#nlp#semantic-similarity#representation-learning

What it is

An embedding is a numerical representation of data — text, image, audio — as a fixed-dimension dense vector. The fundamental property is that semantically similar data produces nearby vectors in the space, while different data ends up far apart.

For example, the embeddings for "dog" and "puppy" will be close together, while "dog" and "economics" will be far apart. This allows machines to operate on "meaning" mathematically.

How they work

Generation

An embedding model (like text-embedding-3-small from OpenAI or all-MiniLM-L6-v2 from Sentence Transformers) takes input text and produces a fixed-dimension vector — typically between 384 and 3,072 dimensions.

The model learns these representations during training, optimizing so that texts with similar meaning produce nearby vectors.

Similarity metrics

To compare embeddings, distance metrics are used:

  • Cosine similarity: measures the angle between vectors (most common)
  • Dot product: similar to cosine but sensitive to magnitude
  • Euclidean distance: direct geometric distance between points

Types of embeddings

  • Word embeddings: one vector per word (Word2Vec, GloVe) — historical but limited
  • Sentence embeddings: one vector per sentence or paragraph — the current standard
  • Multimodal: vectors representing text and images in the same space (CLIP)

Example with Sentence Transformers

from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer("all-MiniLM-L6-v2")
 
texts = [
    "The dog runs through the park",
    "A puppy plays in the garden",
    "Inflation affects the global economy"
]
 
embeddings = model.encode(texts)
 
# Cosine similarity between first two (semantically close)
sim_01 = np.dot(embeddings[0], embeddings[1]) / (
    np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
)
# sim_01 ≈ 0.68 (high similarity)
 
# Similarity between first and third (semantically distant)
sim_02 = np.dot(embeddings[0], embeddings[2]) / (
    np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[2])
)
# sim_02 ≈ 0.05 (low similarity)

Popular models

ModelDimensionsMax contextTypical use
all-MiniLM-L6-v2384256 tokensFast prototyping, low cost
text-embedding-3-small (OpenAI)1,5368,191 tokensProduction with API
text-embedding-3-large (OpenAI)3,0728,191 tokensMaximum quality
amazon.titan-embed-text-v21,0248,192 tokensAWS Bedrock
voyage-3 (Voyage AI)1,02432,000 tokensLong context, code

The choice depends on the balance between quality, cost, and latency. For most RAG applications, a 1,024-dimension model offers a good balance.

Applications

ApplicationHow it uses embeddingsSimilarity metric
Semantic searchCompares query embedding with document embeddingsCosine similarity
RAGRetrieves relevant chunks to give context to the LLMCosine similarity + reranking
ClassificationGroups documents by proximity in vector spaceEuclidean distance or cosine
Duplicate detectionIdentifies content with high similaritySimilarity threshold (> 0.9)
RecommendationsSuggests content close to user profilek-nearest neighbors

Practical considerations

  • Dimensionality vs. performance: more dimensions capture more nuance but require more storage and compute
  • Model matters: the same text produces different embeddings with different models — they're not interchangeable
  • Chunking: for long documents, it's better to generate embeddings per chunk than per complete document
  • Normalization: some models require normalizing vectors before comparison

Why it matters

Embeddings are the foundation of semantic search, RAG systems, and content classification. Without them, AI applications are limited to exact text matching. Understanding their properties — dimensionality, cosine distance, language limitations — is essential for building effective information retrieval systems.

References

  • Efficient Estimation of Word Representations in Vector Space — Mikolov et al., 2013. The original Word2Vec paper.
  • Sentence-BERT — Reimers & Gurevych, 2019. Efficient sentence embeddings based on BERT.
  • Text Embeddings by Weakly-Supervised Contrastive Pre-training — Wang et al., 2022. E5, general-purpose text embeddings.
  • MTEB: Massive Text Embedding Benchmark — Hugging Face, 2022. Benchmark for comparing embedding models.
  • Pretrained Models — Sentence Transformers — SBERT, 2024. Catalog of pretrained models with metrics.

Related content

  • Neural Networks

    Computational models inspired by brain structure that learn patterns from data, forming the foundation of modern artificial intelligence systems.

  • Semantic Search

    Information retrieval technique that uses vector embeddings to find results by meaning, not just exact keyword matching.

  • Large Language Models

    Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.

  • Retrieval-Augmented Generation

    Architectural pattern that combines information retrieval from external sources with LLM text generation, reducing hallucinations and keeping knowledge current without retraining the model.

  • Building a Second Brain in Public

    Chronicle of building a second brain with a knowledge graph, bilingual pipeline, and agent endpoints — in days, not weeks, and what that teaches about the gap between theory and working systems.

  • Vector Databases

    Storage systems specialized in indexing and searching high-dimensional vectors efficiently, enabling semantic search and RAG applications at scale.

  • Tokenization

    Process of splitting text into discrete units (tokens) that language models can process numerically, fundamental to how LLMs understand and generate text.

Concepts