Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Context Windows

The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.

evergreen#context-window#tokens#llm#memory#attention#scaling

What it is

The context window is the maximum limit of tokens (words and subwords) that an LLM can process in a single interaction. It includes both input (prompt, context, conversation history) and output generated by the model. It's essentially the model's "working memory" during a session.

A token doesn't exactly equal a word. In English, 1,000 tokens represent approximately 750 words, while in Spanish it's about 600 words due to morphological differences in the language. This window determines how much context the model can maintain to generate coherent and contextually relevant responses.

The context window is a physical constraint of the model, determined by its architecture and training. It cannot be dynamically expanded — it's a fixed limit that defines the system's fundamental capabilities.

Historical size evolution

YearModelWindowApproximate equivalent
2022GPT-3.54K tokens~8 pages
2023GPT-48K–32K tokens~16-64 pages
2023Claude 2100K tokens~200 pages
2024Claude 3200K tokens~400 pages
2024Gemini 1.51M–2M tokens~2,000-4,000 pages

This evolution reflects advances in efficient attention architectures, training techniques, and computational capacity. Recent models can process complete documents, extensive codebases, or very long conversations in a single interaction.

Needle in a Haystack tests

"Needle in a haystack" tests evaluate how well models can retrieve specific information within long contexts. Results show consistent patterns:

  • Initial and final position: Models retrieve information better when it's at the beginning or end of context
  • Middle degradation: Accuracy significantly decreases for information located in the middle of context
  • Size dependency: Larger context windows show greater degradation in intermediate positions
# Example needle-in-haystack test
def needle_in_haystack_test(model, context_length, needle_position):
    # Generate distractor context
    haystack = generate_distractor_text(context_length - 100)
    
    # Insert "needle" (specific information) at determined position
    needle = "The secret key is: ALPHA-7829"
    
    if needle_position == "start":
        context = needle + "\n\n" + haystack
    elif needle_position == "middle":
        mid = len(haystack) // 2
        context = haystack[:mid] + "\n\n" + needle + "\n\n" + haystack[mid:]
    else:  # end
        context = haystack + "\n\n" + needle
    
    # Ask for the specific information
    prompt = context + "\n\nWhat is the secret key mentioned in the text?"
    
    response = model.generate(prompt)
    return "ALPHA-7829" in response

Context management strategies

Smart chunking

Dividing long documents requires preserving semantic coherence:

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
def smart_chunking(document, max_chunk_size=1000, overlap=200):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    chunks = splitter.split_text(document)
    
    # Add document context to each chunk
    enriched_chunks = []
    for i, chunk in enumerate(chunks):
        context_header = f"Document: {document.title}\nSection {i+1}/{len(chunks)}\n\n"
        enriched_chunks.append(context_header + chunk)
    
    return enriched_chunks

Progressive summarization

For long conversations or extensive document analysis:

def progressive_summarization(messages, max_context_tokens=8000):
    current_tokens = count_tokens(messages)
    
    if current_tokens <= max_context_tokens:
        return messages
    
    # Summarize older messages, preserve recent ones
    recent_messages = messages[-10:]  # Last 10 messages
    old_messages = messages[:-10]
    
    summary_prompt = f"""
    Summarize this conversation maintaining:
    - Important technical decisions
    - Problem context
    - Reached conclusions
    
    Conversation:
    {format_messages(old_messages)}
    """
    
    summary = llm.generate(summary_prompt)
    
    return [{"role": "system", "content": f"Previous conversation summary: {summary}"}] + recent_messages

Context prioritization

In RAG systems, ordering information by relevance:

def prioritize_context(query, retrieved_docs, max_tokens=6000):
    # Calculate semantic relevance
    relevance_scores = []
    for doc in retrieved_docs:
        score = calculate_semantic_similarity(query, doc.content)
        relevance_scores.append((doc, score))
    
    # Sort by relevance
    sorted_docs = sorted(relevance_scores, key=lambda x: x[1], reverse=True)
    
    # Select documents that fit in the window
    selected_docs = []
    current_tokens = 0
    
    for doc, score in sorted_docs:
        doc_tokens = count_tokens(doc.content)
        if current_tokens + doc_tokens <= max_tokens:
            selected_docs.append(doc)
            current_tokens += doc_tokens
        else:
            break
    
    return selected_docs

Performance and cost implications

Attention in Transformer architectures scales quadratically with sequence length (O(n²)). This means doubling the context window quadruples the required computation. Providers reflect this in their pricing models:

  • Cost per token: Increases with larger windows
  • Latency: Grows significantly with long contexts
  • Throughput: Decreases when processing extensive contexts

Techniques like sparse attention, sliding window attention, and ring attention partially mitigate these costs, but the fundamental trade-off persists.

Why it matters

The context window is the most important architectural constraint in LLM-based systems. It defines which usage patterns are viable: from RAG that must carefully select which documents to include, to agents that need to maintain state across multiple iterations.

For staff+ engineers, understanding these limitations is crucial for designing scalable systems. It's not just about "making it work," but optimizing for cost, latency, and accuracy simultaneously. Decisions about chunking, context caching, and summarization strategies directly impact user experience and operational costs.

Efficient context management distinguishes amateur implementations from robust production systems. It's the difference between a prototype that works with small documents and a platform that scales to enterprise knowledge bases.

References

  • Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023. Fundamental research on attention degradation in long contexts.
  • Effective Long-Context Scaling of Foundation Models — Anthropic, 2023. Techniques for efficiently scaling context windows.
  • Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Google, 2024. Architecture for virtually infinite contexts.
  • Long context — Gemini API — Google AI, 2024. Official documentation on long context capabilities.
  • Prompt caching - Claude API Docs — Anthropic, 2024. Context window usage optimization through caching.
  • Introducing the next generation of Claude — Anthropic, 2024. Announcement of context capability improvements.

Related content

  • Large Language Models

    Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.

  • Prompt Engineering

    The discipline of designing effective instructions for language models, combining clarity, structure, and examples to obtain consistent, high-quality responses.

  • Retrieval-Augmented Generation

    Architectural pattern that combines information retrieval from external sources with LLM text generation, reducing hallucinations and keeping knowledge current without retraining the model.

  • Prompt Caching

    Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.

Concepts