Context Windows

What it is

The context window is the maximum limit of tokens (words and subwords) that an LLM can process in a single interaction. It includes both input (prompt, context, conversation history) and output generated by the model. It's essentially the model's "working memory" during a session.

A token doesn't exactly equal a word. In English, 1,000 tokens represent approximately 750 words, while in Spanish it's about 600 words due to morphological differences in the language. This window determines how much context the model can maintain to generate coherent and contextually relevant responses.

The context window is a physical constraint of the model, determined by its architecture and training. It cannot be dynamically expanded — it's a fixed limit that defines the system's fundamental capabilities.

Historical size evolution

Year	Model	Window	Approximate equivalent
2022	GPT-3.5	4K tokens	~8 pages
2023	GPT-4	8K–32K tokens	~16-64 pages
2023	Claude 2	100K tokens	~200 pages
2024	Claude 3	200K tokens	~400 pages
2024	Gemini 1.5	1M–2M tokens	~2,000-4,000 pages

This evolution reflects advances in efficient attention architectures, training techniques, and computational capacity. Recent models can process complete documents, extensive codebases, or very long conversations in a single interaction.

Needle in a Haystack tests

"Needle in a haystack" tests evaluate how well models can retrieve specific information within long contexts. Results show consistent patterns:

Initial and final position: Models retrieve information better when it's at the beginning or end of context
Middle degradation: Accuracy significantly decreases for information located in the middle of context
Size dependency: Larger context windows show greater degradation in intermediate positions

# Example needle-in-haystack test
def needle_in_haystack_test(model, context_length, needle_position):
    # Generate distractor context
    haystack = generate_distractor_text(context_length - 100)
    
    # Insert "needle" (specific information) at determined position
    needle = "The secret key is: ALPHA-7829"
    
    if needle_position == "start":
        context = needle + "\n\n" + haystack
    elif needle_position == "middle":
        mid = len(haystack) // 2
        context = haystack[:mid] + "\n\n" + needle + "\n\n" + haystack[mid:]
    else:  # end
        context = haystack + "\n\n" + needle
    
    # Ask for the specific information
    prompt = context + "\n\nWhat is the secret key mentioned in the text?"
    
    response = model.generate(prompt)
    return "ALPHA-7829" in response

Context management strategies

Smart chunking

Dividing long documents requires preserving semantic coherence:

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
def smart_chunking(document, max_chunk_size=1000, overlap=200):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    chunks = splitter.split_text(document)
    
    # Add document context to each chunk
    enriched_chunks = []
    for i, chunk in enumerate(chunks):
        context_header = f"Document: {document.title}\nSection {i+1}/{len(chunks)}\n\n"
        enriched_chunks.append(context_header + chunk)
    
    return enriched_chunks

Progressive summarization

For long conversations or extensive document analysis:

def progressive_summarization(messages, max_context_tokens=8000):
    current_tokens = count_tokens(messages)
    
    if current_tokens <= max_context_tokens:
        return messages
    
    # Summarize older messages, preserve recent ones
    recent_messages = messages[-10:]  # Last 10 messages
    old_messages = messages[:-10]
    
    summary_prompt = f"""
    Summarize this conversation maintaining:
    - Important technical decisions
    - Problem context
    - Reached conclusions
    
    Conversation:
    {format_messages(old_messages)}
    """
    
    summary = llm.generate(summary_prompt)
    
    return [{"role": "system", "content": f"Previous conversation summary: {summary}"}] + recent_messages

Context prioritization

In RAG systems, ordering information by relevance:

def prioritize_context(query, retrieved_docs, max_tokens=6000):
    # Calculate semantic relevance
    relevance_scores = []
    for doc in retrieved_docs:
        score = calculate_semantic_similarity(query, doc.content)
        relevance_scores.append((doc, score))
    
    # Sort by relevance
    sorted_docs = sorted(relevance_scores, key=lambda x: x[1], reverse=True)
    
    # Select documents that fit in the window
    selected_docs = []
    current_tokens = 0
    
    for doc, score in sorted_docs:
        doc_tokens = count_tokens(doc.content)
        if current_tokens + doc_tokens <= max_tokens:
            selected_docs.append(doc)
            current_tokens += doc_tokens
        else:
            break
    
    return selected_docs

Performance and cost implications

Attention in Transformer architectures scales quadratically with sequence length (O(n²)). This means doubling the context window quadruples the required computation. Providers reflect this in their pricing models:

Cost per token: Increases with larger windows
Latency: Grows significantly with long contexts
Throughput: Decreases when processing extensive contexts

Techniques like sparse attention, sliding window attention, and ring attention partially mitigate these costs, but the fundamental trade-off persists.

Why it matters

The context window is the most important architectural constraint in LLM-based systems. It defines which usage patterns are viable: from RAG that must carefully select which documents to include, to agents that need to maintain state across multiple iterations.

For staff+ engineers, understanding these limitations is crucial for designing scalable systems. It's not just about "making it work," but optimizing for cost, latency, and accuracy simultaneously. Decisions about chunking, context caching, and summarization strategies directly impact user experience and operational costs.

Efficient context management distinguishes amateur implementations from robust production systems. It's the difference between a prototype that works with small documents and a platform that scales to enterprise knowledge bases.

References

Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023. Fundamental research on attention degradation in long contexts.
Effective Long-Context Scaling of Foundation Models — Anthropic, 2023. Techniques for efficiently scaling context windows.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Google, 2024. Architecture for virtually infinite contexts.
Long context — Gemini API — Google AI, 2024. Official documentation on long context capabilities.
Prompt caching - Claude API Docs — Anthropic, 2024. Context window usage optimization through caching.
Introducing the next generation of Claude — Anthropic, 2024. Announcement of context capability improvements.

What it is

Historical size evolution

Year	Model	Window	Approximate equivalent
2022	GPT-3.5	4K tokens	~8 pages
2023	GPT-4	8K–32K tokens	~16-64 pages
2023	Claude 2	100K tokens	~200 pages
2024	Claude 3	200K tokens	~400 pages
2024	Gemini 1.5	1M–2M tokens	~2,000-4,000 pages

Needle in a Haystack tests

"Needle in a haystack" tests evaluate how well models can retrieve specific information within long contexts. Results show consistent patterns:

Initial and final position: Models retrieve information better when it's at the beginning or end of context
Middle degradation: Accuracy significantly decreases for information located in the middle of context
Size dependency: Larger context windows show greater degradation in intermediate positions

# Example needle-in-haystack test
def needle_in_haystack_test(model, context_length, needle_position):
    # Generate distractor context
    haystack = generate_distractor_text(context_length - 100)
    
    # Insert "needle" (specific information) at determined position
    needle = "The secret key is: ALPHA-7829"
    
    if needle_position == "start":
        context = needle + "\n\n" + haystack
    elif needle_position == "middle":
        mid = len(haystack) // 2
        context = haystack[:mid] + "\n\n" + needle + "\n\n" + haystack[mid:]
    else:  # end
        context = haystack + "\n\n" + needle
    
    # Ask for the specific information
    prompt = context + "\n\nWhat is the secret key mentioned in the text?"
    
    response = model.generate(prompt)
    return "ALPHA-7829" in response

Context management strategies

Smart chunking

Dividing long documents requires preserving semantic coherence:

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
def smart_chunking(document, max_chunk_size=1000, overlap=200):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    chunks = splitter.split_text(document)
    
    # Add document context to each chunk
    enriched_chunks = []
    for i, chunk in enumerate(chunks):
        context_header = f"Document: {document.title}\nSection {i+1}/{len(chunks)}\n\n"
        enriched_chunks.append(context_header + chunk)
    
    return enriched_chunks

Progressive summarization

For long conversations or extensive document analysis:

def progressive_summarization(messages, max_context_tokens=8000):
    current_tokens = count_tokens(messages)
    
    if current_tokens <= max_context_tokens:
        return messages
    
    # Summarize older messages, preserve recent ones
    recent_messages = messages[-10:]  # Last 10 messages
    old_messages = messages[:-10]
    
    summary_prompt = f"""
    Summarize this conversation maintaining:
    - Important technical decisions
    - Problem context
    - Reached conclusions
    
    Conversation:
    {format_messages(old_messages)}
    """
    
    summary = llm.generate(summary_prompt)
    
    return [{"role": "system", "content": f"Previous conversation summary: {summary}"}] + recent_messages

Context prioritization

In RAG systems, ordering information by relevance:

def prioritize_context(query, retrieved_docs, max_tokens=6000):
    # Calculate semantic relevance
    relevance_scores = []
    for doc in retrieved_docs:
        score = calculate_semantic_similarity(query, doc.content)
        relevance_scores.append((doc, score))
    
    # Sort by relevance
    sorted_docs = sorted(relevance_scores, key=lambda x: x[1], reverse=True)
    
    # Select documents that fit in the window
    selected_docs = []
    current_tokens = 0
    
    for doc, score in sorted_docs:
        doc_tokens = count_tokens(doc.content)
        if current_tokens + doc_tokens <= max_tokens:
            selected_docs.append(doc)
            current_tokens += doc_tokens
        else:
            break
    
    return selected_docs

Performance and cost implications

Cost per token: Increases with larger windows
Latency: Grows significantly with long contexts
Throughput: Decreases when processing extensive contexts

Techniques like sparse attention, sliding window attention, and ring attention partially mitigate these costs, but the fundamental trade-off persists.

Why it matters

References

Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023. Fundamental research on attention degradation in long contexts.
Effective Long-Context Scaling of Foundation Models — Anthropic, 2023. Techniques for efficiently scaling context windows.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Google, 2024. Architecture for virtually infinite contexts.
Long context — Gemini API — Google AI, 2024. Official documentation on long context capabilities.
Prompt caching - Claude API Docs — Anthropic, 2024. Context window usage optimization through caching.
Introducing the next generation of Claude — Anthropic, 2024. Announcement of context capability improvements.

Context Windows

What it is

Historical size evolution

Needle in a Haystack tests

Context management strategies

Smart chunking

Progressive summarization

Context prioritization

Performance and cost implications

Why it matters

References

Related content

Context Windows

What it is

Historical size evolution

Needle in a Haystack tests

Context management strategies

Smart chunking

Progressive summarization

Context prioritization

Performance and cost implications

Why it matters

References

Related content