Tokenization
Process of splitting text into discrete units (tokens) that language models can process numerically, fundamental to how LLMs understand and generate text.
What it is
Tokenization is the process of converting text into a sequence of tokens — discrete units that the model can process. LLMs don't see characters or words directly; they see numeric IDs representing tokens from their vocabulary.
Why not use words?
- Infinite vocabulary: new words, proper nouns, typos
- Languages: each language would need a different vocabulary
- Efficiency: rare words would waste vocabulary space
Main algorithms
BPE (Byte Pair Encoding)
The most common. Starts with individual characters and iteratively merges the most frequent pairs:
"lower" → ["low", "er"]
"lowest" → ["low", "est"]
Used by: GPT, Llama, Mistral.
WordPiece
Similar to BPE but optimizes by language model likelihood. Uses ## prefix for subwords:
"tokenization" → ["token", "##ization"]
Used by: BERT, Google models.
SentencePiece
Treats text as a byte sequence without preprocessing. Useful for languages without clear word boundaries.
Practical implications
- Cost: billing is per token, not per word. "Tokenization" can be 1-3 tokens depending on the model
- Context: the context window is measured in tokens
- Languages: Spanish and other languages with longer words use more tokens than English
- Code: symbols and syntax can tokenize in unexpected ways
- Numbers: models tokenize numbers in ways that make arithmetic difficult
Tools
- tiktoken — OpenAI tokenizer
- tokenizers — Hugging Face library
- Each provider's playground typically shows token counts
Example
Text: "Language models are fascinating"
GPT-4 tokens: ["Language", " models", " are", " fascinating"]
= 4 tokens
Text: "Los modelos de lenguaje son fascinantes"
GPT-4 tokens: ["Los", " modelos", " de", " lenguaje", " son", " fascin", "antes"]
= 7 tokens
The same concept uses ~75% more tokens in Spanish.
Why it matters
Tokenization determines how a model "sees" text. An inefficient tokenizer wastes context on redundant tokens, increases costs, and degrades quality in non-English languages. Understanding how it works enables prompt optimization, accurate cost estimation, and diagnosing unexpected model behavior.
References
- Neural Machine Translation of Rare Words with Subword Units — Sennrich et al., 2015. Original BPE paper for NLP.
- SentencePiece — Kudo & Richardson, 2018.