Retrieval-Augmented Generation (RAG)

What it is

RAG (Retrieval-Augmented Generation) is a pattern that improves LLM responses by injecting relevant information retrieved from external sources directly into the prompt context. Instead of relying solely on knowledge stored in the model's weights, the system searches for relevant documents and includes them as context before generating the response.

The concept was formalized by Lewis et al. in 2020, but has become the dominant pattern for enterprise generative AI applications.

How it works

The typical RAG flow has three stages:

1. Indexing (offline)

Source documents are processed and stored for efficient search:

Documents are split into manageable chunks
Each chunk is converted into an embedding — a numerical vector capturing its semantic meaning
Vectors are stored in a vector database or search index

2. Retrieval (runtime)

When a user query arrives:

The query is converted to an embedding using the same model
The most similar chunks are found by vector distance (cosine, dot product)
The top-K most relevant chunks are selected

3. Generation (runtime)

Retrieved chunks are injected into the prompt as context
The LLM generates a response based on the query AND the provided context
Optionally, citations to original sources are included

Minimal example with LangChain

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
 
# 1. Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)
 
# 2. Indexing
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
 
# 3. Retrieval + Generation
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
answer = qa.invoke("What is the return policy?")

Chunking strategies

Chunking quality determines retrieval quality:

Strategy	Typical size	Best for
Fixed size	256-512 tokens	Homogeneous documents
Recursive by separators	512-1,024 tokens	Structured text (Markdown, HTML)
Semantic	Variable	Documents where meaning crosses paragraphs
Per document	Full document	Short documents (FAQs, cards)

An overlap of 10-20% between chunks helps preserve context at boundaries.

Advanced patterns

Hybrid RAG: combines vector search with keyword search (BM25) for better coverage
Iterative RAG: the agent performs multiple retrieval rounds, refining the search based on intermediate results
RAG with reranking: a secondary model reorders search results by relevance before passing them to the LLM
GraphRAG: uses knowledge graphs instead of (or in addition to) vector search

Why not just fine-tuning?

Aspect	RAG	Fine-tuning
Data updates	Immediate (change documents)	Requires retraining
Cost	Low (search infrastructure)	High (GPU, labeled data)
Traceability	High (source citations)	Low (knowledge in weights)
Hallucinations	Reduced (factual context)	Persist
Specialized knowledge	Good with good documents	Better for style/format

In practice, many systems combine both: fine-tuning for style and format, RAG for factual knowledge.

Connection with llms.txt

The llms.txt standard is a practical example of RAG: it provides agents with a structured document they can retrieve and use as context to answer questions about a site or project.

Limitations

Chunking quality: poorly split fragments produce irrelevant context
Context limit: you can't inject infinite documents — the LLM's context window is finite
Latency: the retrieval stage adds time to each query
Garbage in, garbage out: if source documents have errors, the LLM will propagate them confidently

Why it matters

RAG is the most practical technique for giving LLMs access to up-to-date, domain-specific information without fine-tuning. It combines the model's generative capability with data retrieved in real time, reducing hallucinations and keeping responses grounded in verifiable sources.

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., 2020. The original paper that formalized RAG.
From Local to Global: A Graph RAG Approach — Microsoft Research, 2024. GraphRAG for queries over complete corpora.
RAGAS: Automated Evaluation of RAG — Es et al., 2023. Evaluation framework for RAG systems.
RAG Options for Foundation Models — AWS — AWS, 2024. Prescriptive guide for RAG patterns in production.
What is Retrieval-Augmented Generation? — IBM Research — IBM, 2023. Explanation of the concept and its enterprise applications.

What it is

The concept was formalized by Lewis et al. in 2020, but has become the dominant pattern for enterprise generative AI applications.

How it works

The typical RAG flow has three stages:

1. Indexing (offline)

Source documents are processed and stored for efficient search:

Documents are split into manageable chunks
Each chunk is converted into an embedding — a numerical vector capturing its semantic meaning
Vectors are stored in a vector database or search index

2. Retrieval (runtime)

When a user query arrives:

The query is converted to an embedding using the same model
The most similar chunks are found by vector distance (cosine, dot product)
The top-K most relevant chunks are selected

3. Generation (runtime)

Retrieved chunks are injected into the prompt as context
The LLM generates a response based on the query AND the provided context
Optionally, citations to original sources are included

Minimal example with LangChain

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
 
# 1. Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)
 
# 2. Indexing
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
 
# 3. Retrieval + Generation
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
answer = qa.invoke("What is the return policy?")

Chunking strategies

Chunking quality determines retrieval quality:

Strategy	Typical size	Best for
Fixed size	256-512 tokens	Homogeneous documents
Recursive by separators	512-1,024 tokens	Structured text (Markdown, HTML)
Semantic	Variable	Documents where meaning crosses paragraphs
Per document	Full document	Short documents (FAQs, cards)

An overlap of 10-20% between chunks helps preserve context at boundaries.

Advanced patterns

Hybrid RAG: combines vector search with keyword search (BM25) for better coverage
Iterative RAG: the agent performs multiple retrieval rounds, refining the search based on intermediate results
RAG with reranking: a secondary model reorders search results by relevance before passing them to the LLM
GraphRAG: uses knowledge graphs instead of (or in addition to) vector search

Why not just fine-tuning?

Aspect	RAG	Fine-tuning
Data updates	Immediate (change documents)	Requires retraining
Cost	Low (search infrastructure)	High (GPU, labeled data)
Traceability	High (source citations)	Low (knowledge in weights)
Hallucinations	Reduced (factual context)	Persist
Specialized knowledge	Good with good documents	Better for style/format

In practice, many systems combine both: fine-tuning for style and format, RAG for factual knowledge.

Connection with llms.txt

The llms.txt standard is a practical example of RAG: it provides agents with a structured document they can retrieve and use as context to answer questions about a site or project.

Limitations

Chunking quality: poorly split fragments produce irrelevant context
Context limit: you can't inject infinite documents — the LLM's context window is finite
Latency: the retrieval stage adds time to each query
Garbage in, garbage out: if source documents have errors, the LLM will propagate them confidently

Why it matters

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., 2020. The original paper that formalized RAG.
From Local to Global: A Graph RAG Approach — Microsoft Research, 2024. GraphRAG for queries over complete corpora.
RAGAS: Automated Evaluation of RAG — Es et al., 2023. Evaluation framework for RAG systems.
RAG Options for Foundation Models — AWS — AWS, 2024. Prescriptive guide for RAG patterns in production.
What is Retrieval-Augmented Generation? — IBM Research — IBM, 2023. Explanation of the concept and its enterprise applications.

Retrieval-Augmented Generation

What it is

How it works

1. Indexing (offline)

2. Retrieval (runtime)

3. Generation (runtime)

Minimal example with LangChain

Chunking strategies

Advanced patterns

Why not just fine-tuning?

Connection with llms.txt

Limitations

Why it matters

References

Related content

Retrieval-Augmented Generation

What it is

How it works

1. Indexing (offline)

2. Retrieval (runtime)

3. Generation (runtime)

Minimal example with LangChain

Chunking strategies

Advanced patterns

Why not just fine-tuning?

Connection with llms.txt

Limitations

Why it matters

References

Related content