Concepts

Tokenization

Process of splitting text into discrete units (tokens) that language models can process numerically, fundamental to how LLMs understand and generate text.

seed#tokenization#bpe#tokens#nlp#llm#preprocessing

What it is

Tokenization is the process of converting text into a sequence of tokens — discrete units that the model can process. LLMs don't see characters or words directly; they see numeric IDs representing tokens from their vocabulary.

Why not use words?

  • Infinite vocabulary: new words, proper nouns, typos
  • Languages: each language would need a different vocabulary
  • Efficiency: rare words would waste vocabulary space

Main algorithms

BPE (Byte Pair Encoding)

The most common. Starts with individual characters and iteratively merges the most frequent pairs:

"lower" → ["low", "er"]
"lowest" → ["low", "est"]

Used by: GPT, Llama, Mistral.

WordPiece

Similar to BPE but optimizes by language model likelihood. Uses ## prefix for subwords:

"tokenization" → ["token", "##ization"]

Used by: BERT, Google models.

SentencePiece

Treats text as a byte sequence without preprocessing. Useful for languages without clear word boundaries.

Practical implications

  • Cost: billing is per token, not per word. "Tokenization" can be 1-3 tokens depending on the model
  • Context: the context window is measured in tokens
  • Languages: Spanish and other languages with longer words use more tokens than English
  • Code: symbols and syntax can tokenize in unexpected ways
  • Numbers: models tokenize numbers in ways that make arithmetic difficult

Tools

  • tiktoken — OpenAI tokenizer
  • tokenizers — Hugging Face library
  • Each provider's playground typically shows token counts

Example

Text: "Language models are fascinating"
GPT-4 tokens: ["Language", " models", " are", " fascinating"]
= 4 tokens

Text: "Los modelos de lenguaje son fascinantes"
GPT-4 tokens: ["Los", " modelos", " de", " lenguaje", " son", " fascin", "antes"]
= 7 tokens

The same concept uses ~75% more tokens in Spanish.

Why it matters

Tokenization determines how a model "sees" text. An inefficient tokenizer wastes context on redundant tokens, increases costs, and degrades quality in non-English languages. Understanding how it works enables prompt optimization, accurate cost estimation, and diagnosing unexpected model behavior.

References

Concepts