The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.
The context window is the maximum limit of tokens (words and subwords) that an LLM can process in a single interaction. It includes both input (prompt, context, conversation history) and output generated by the model. It's essentially the model's "working memory" during a session.
A token doesn't exactly equal a word. In English, 1,000 tokens represent approximately 750 words, while in Spanish it's about 600 words due to morphological differences in the language. This window determines how much context the model can maintain to generate coherent and contextually relevant responses.
The context window is a physical constraint of the model, determined by its architecture and training. It cannot be dynamically expanded — it's a fixed limit that defines the system's fundamental capabilities.
| Year | Model | Window | Approximate equivalent |
|---|---|---|---|
| 2022 | GPT-3.5 | 4K tokens | ~8 pages |
| 2023 | GPT-4 | 8K–32K tokens | ~16-64 pages |
| 2023 | Claude 2 | 100K tokens | ~200 pages |
| 2024 | Claude 3 | 200K tokens | ~400 pages |
| 2024 | Gemini 1.5 | 1M–2M tokens | ~2,000-4,000 pages |
This evolution reflects advances in efficient attention architectures, training techniques, and computational capacity. Recent models can process complete documents, extensive codebases, or very long conversations in a single interaction.
"Needle in a haystack" tests evaluate how well models can retrieve specific information within long contexts. Results show consistent patterns:
# Example needle-in-haystack test
def needle_in_haystack_test(model, context_length, needle_position):
# Generate distractor context
haystack = generate_distractor_text(context_length - 100)
# Insert "needle" (specific information) at determined position
needle = "The secret key is: ALPHA-7829"
if needle_position == "start":
context = needle + "\n\n" + haystack
elif needle_position == "middle":
mid = len(haystack) // 2
context = haystack[:mid] + "\n\n" + needle + "\n\n" + haystack[mid:]
else: # end
context = haystack + "\n\n" + needle
# Ask for the specific information
prompt = context + "\n\nWhat is the secret key mentioned in the text?"
response = model.generate(prompt)
return "ALPHA-7829" in responseDividing long documents requires preserving semantic coherence:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def smart_chunking(document, max_chunk_size=1000, overlap=200):
splitter = RecursiveCharacterTextSplitter(
chunk_size=max_chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)
# Add document context to each chunk
enriched_chunks = []
for i, chunk in enumerate(chunks):
context_header = f"Document: {document.title}\nSection {i+1}/{len(chunks)}\n\n"
enriched_chunks.append(context_header + chunk)
return enriched_chunksFor long conversations or extensive document analysis:
def progressive_summarization(messages, max_context_tokens=8000):
current_tokens = count_tokens(messages)
if current_tokens <= max_context_tokens:
return messages
# Summarize older messages, preserve recent ones
recent_messages = messages[-10:] # Last 10 messages
old_messages = messages[:-10]
summary_prompt = f"""
Summarize this conversation maintaining:
- Important technical decisions
- Problem context
- Reached conclusions
Conversation:
{format_messages(old_messages)}
"""
summary = llm.generate(summary_prompt)
return [{"role": "system", "content": f"Previous conversation summary: {summary}"}] + recent_messagesIn RAG systems, ordering information by relevance:
def prioritize_context(query, retrieved_docs, max_tokens=6000):
# Calculate semantic relevance
relevance_scores = []
for doc in retrieved_docs:
score = calculate_semantic_similarity(query, doc.content)
relevance_scores.append((doc, score))
# Sort by relevance
sorted_docs = sorted(relevance_scores, key=lambda x: x[1], reverse=True)
# Select documents that fit in the window
selected_docs = []
current_tokens = 0
for doc, score in sorted_docs:
doc_tokens = count_tokens(doc.content)
if current_tokens + doc_tokens <= max_tokens:
selected_docs.append(doc)
current_tokens += doc_tokens
else:
break
return selected_docsAttention in Transformer architectures scales quadratically with sequence length (O(n²)). This means doubling the context window quadruples the required computation. Providers reflect this in their pricing models:
Techniques like sparse attention, sliding window attention, and ring attention partially mitigate these costs, but the fundamental trade-off persists.
The context window is the most important architectural constraint in LLM-based systems. It defines which usage patterns are viable: from RAG that must carefully select which documents to include, to agents that need to maintain state across multiple iterations.
For staff+ engineers, understanding these limitations is crucial for designing scalable systems. It's not just about "making it work," but optimizing for cost, latency, and accuracy simultaneously. Decisions about chunking, context caching, and summarization strategies directly impact user experience and operational costs.
Efficient context management distinguishes amateur implementations from robust production systems. It's the difference between a prototype that works with small documents and a platform that scales to enterprise knowledge bases.
Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.
The discipline of designing effective instructions for language models, combining clarity, structure, and examples to obtain consistent, high-quality responses.
Architectural pattern that combines information retrieval from external sources with LLM text generation, reducing hallucinations and keeping knowledge current without retraining the model.
Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.