Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Retrieval-Augmented Generation

Architectural pattern that combines information retrieval from external sources with LLM text generation, reducing hallucinations and keeping knowledge current without retraining the model.

evergreen#rag#llm#embeddings#vector-search#information-retrieval#ai-architecture

What it is

RAG (Retrieval-Augmented Generation) is a pattern that improves LLM responses by injecting relevant information retrieved from external sources directly into the prompt context. Instead of relying solely on knowledge stored in the model's weights, the system searches for relevant documents and includes them as context before generating the response.

The concept was formalized by Lewis et al. in 2020, but has become the dominant pattern for enterprise generative AI applications.

How it works

The typical RAG flow has three stages:

1. Indexing (offline)

Source documents are processed and stored for efficient search:

  • Documents are split into manageable chunks
  • Each chunk is converted into an embedding — a numerical vector capturing its semantic meaning
  • Vectors are stored in a vector database or search index

2. Retrieval (runtime)

When a user query arrives:

  • The query is converted to an embedding using the same model
  • The most similar chunks are found by vector distance (cosine, dot product)
  • The top-K most relevant chunks are selected

3. Generation (runtime)

  • Retrieved chunks are injected into the prompt as context
  • The LLM generates a response based on the query AND the provided context
  • Optionally, citations to original sources are included

Minimal example with LangChain

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
 
# 1. Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)
 
# 2. Indexing
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
 
# 3. Retrieval + Generation
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
answer = qa.invoke("What is the return policy?")

Chunking strategies

Chunking quality determines retrieval quality:

StrategyTypical sizeBest for
Fixed size256-512 tokensHomogeneous documents
Recursive by separators512-1,024 tokensStructured text (Markdown, HTML)
SemanticVariableDocuments where meaning crosses paragraphs
Per documentFull documentShort documents (FAQs, cards)

An overlap of 10-20% between chunks helps preserve context at boundaries.

Advanced patterns

  • Hybrid RAG: combines vector search with keyword search (BM25) for better coverage
  • Iterative RAG: the agent performs multiple retrieval rounds, refining the search based on intermediate results
  • RAG with reranking: a secondary model reorders search results by relevance before passing them to the LLM
  • GraphRAG: uses knowledge graphs instead of (or in addition to) vector search

Why not just fine-tuning?

AspectRAGFine-tuning
Data updatesImmediate (change documents)Requires retraining
CostLow (search infrastructure)High (GPU, labeled data)
TraceabilityHigh (source citations)Low (knowledge in weights)
HallucinationsReduced (factual context)Persist
Specialized knowledgeGood with good documentsBetter for style/format

In practice, many systems combine both: fine-tuning for style and format, RAG for factual knowledge.

Connection with llms.txt

The llms.txt standard is a practical example of RAG: it provides agents with a structured document they can retrieve and use as context to answer questions about a site or project.

Limitations

  • Chunking quality: poorly split fragments produce irrelevant context
  • Context limit: you can't inject infinite documents — the LLM's context window is finite
  • Latency: the retrieval stage adds time to each query
  • Garbage in, garbage out: if source documents have errors, the LLM will propagate them confidently

Why it matters

RAG is the most practical technique for giving LLMs access to up-to-date, domain-specific information without fine-tuning. It combines the model's generative capability with data retrieved in real time, reducing hallucinations and keeping responses grounded in verifiable sources.

References

  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., 2020. The original paper that formalized RAG.
  • From Local to Global: A Graph RAG Approach — Microsoft Research, 2024. GraphRAG for queries over complete corpora.
  • RAGAS: Automated Evaluation of RAG — Es et al., 2023. Evaluation framework for RAG systems.
  • RAG Options for Foundation Models — AWS — AWS, 2024. Prescriptive guide for RAG patterns in production.
  • What is Retrieval-Augmented Generation? — IBM Research — IBM, 2023. Explanation of the concept and its enterprise applications.

Related content

  • Semantic Search

    Information retrieval technique that uses vector embeddings to find results by meaning, not just exact keyword matching.

  • llms.txt

    Proposed standard for publishing a Markdown file at a website's root that enables language models to efficiently understand and use the site's content at inference time.

  • AI Agents

    Autonomous systems that combine language models with reasoning, memory, and tool use to execute complex multi-step tasks with minimal human intervention.

  • Embeddings

    Dense vector representations that capture the semantic meaning of text, images, or other data in a numerical space where proximity reflects conceptual similarity.

  • Vector Databases

    Storage systems specialized in indexing and searching high-dimensional vectors efficiently, enabling semantic search and RAG applications at scale.

  • Hallucination Mitigation

    Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.

  • AWS Bedrock

    AWS serverless service providing access to foundation models from multiple providers (Anthropic, Meta, Mistral, Amazon) via unified API, without managing ML infrastructure.

  • Context Windows

    The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.

  • Building a Second Brain in Public

    Chronicle of building a second brain with a knowledge graph, bilingual pipeline, and agent endpoints — in days, not weeks, and what that teaches about the gap between theory and working systems.

Concepts