Tokenization

What it is

Tokenization is the process of converting text into a sequence of tokens — discrete units that the model can process. LLMs don't see characters or words directly; they see numeric IDs representing tokens from their vocabulary.

Why not use words?

Infinite vocabulary: new words, proper nouns, typos
Languages: each language would need a different vocabulary
Efficiency: rare words would waste vocabulary space

Main algorithms

BPE (Byte Pair Encoding)

The most common. Starts with individual characters and iteratively merges the most frequent pairs:

"lower" → ["low", "er"]
"lowest" → ["low", "est"]

Used by: GPT, Llama, Mistral.

WordPiece

Similar to BPE but optimizes by language model likelihood. Uses ## prefix for subwords:

"tokenization" → ["token", "##ization"]

Used by: BERT, Google models.

SentencePiece

Treats text as a byte sequence without preprocessing. Useful for languages without clear word boundaries.

Practical implications

Cost: billing is per token, not per word. "Tokenization" can be 1-3 tokens depending on the model
Context: the context window is measured in tokens
Languages: Spanish and other languages with longer words use more tokens than English
Code: symbols and syntax can tokenize in unexpected ways
Numbers: models tokenize numbers in ways that make arithmetic difficult

Tools

tiktoken — OpenAI tokenizer
tokenizers — Hugging Face library
Each provider's playground typically shows token counts

Example

Text: "Language models are fascinating"
GPT-4 tokens: ["Language", " models", " are", " fascinating"]
= 4 tokens

Text: "Los modelos de lenguaje son fascinantes"
GPT-4 tokens: ["Los", " modelos", " de", " lenguaje", " son", " fascin", "antes"]
= 7 tokens

The same concept uses ~75% more tokens in Spanish.

Why it matters

Tokenization determines how a model "sees" text. An inefficient tokenizer wastes context on redundant tokens, increases costs, and degrades quality in non-English languages. Understanding how it works enables prompt optimization, accurate cost estimation, and diagnosing unexpected model behavior.

References

Neural Machine Translation of Rare Words with Subword Units — Sennrich et al., 2015. Original BPE paper for NLP.
SentencePiece — Kudo & Richardson, 2018.

Main algorithms

BPE (Byte Pair Encoding)

The most common. Starts with individual characters and iteratively merges the most frequent pairs:

"lower" → ["low", "er"] "lowest" → ["low", "est"]

Used by: GPT, Llama, Mistral.

WordPiece

Similar to BPE but optimizes by language model likelihood. Uses ## prefix for subwords:

"tokenization" → ["token", "##ization"]

Used by: BERT, Google models.

SentencePiece

Treats text as a byte sequence without preprocessing. Useful for languages without clear word boundaries.

Practical implications

Cost: billing is per token, not per word. "Tokenization" can be 1-3 tokens depending on the model

Context: the context window is measured in tokens

Languages: Spanish and other languages with longer words use more tokens than English

Code: symbols and syntax can tokenize in unexpected ways

Numbers: models tokenize numbers in ways that make arithmetic difficult

Example

Text: "Language models are fascinating" GPT-4 tokens: ["Language", " models", " are", " fascinating"] = 4 tokens Text: "Los modelos de lenguaje son fascinantes" GPT-4 tokens: ["Los", " modelos", " de", " lenguaje", " son", " fascin", "antes"] = 7 tokens

The same concept uses ~75% more tokens in Spanish.

Tokenization

What it is

Why not use words?

Main algorithms

BPE (Byte Pair Encoding)

WordPiece

SentencePiece

Practical implications

Tools

Example

Why it matters

References

Related content

Tokenization

What it is

Why not use words?

Main algorithms

BPE (Byte Pair Encoding)

WordPiece

SentencePiece

Practical implications

Tools

Example

Why it matters

References

Related content