Retrieval-Augmented Generation
Architectural pattern that combines information retrieval from external sources with LLM text generation, reducing hallucinations and keeping knowledge current without retraining the model.
What it is
RAG (Retrieval-Augmented Generation) is a pattern that improves LLM responses by injecting relevant information retrieved from external sources directly into the prompt context. Instead of relying solely on knowledge stored in the model's weights, the system searches for relevant documents and includes them as context before generating the response.
The concept was formalized by Lewis et al. in 2020, but has become the dominant pattern for enterprise generative AI applications.
How it works
The typical RAG flow has three stages:
1. Indexing (offline)
Source documents are processed and stored for efficient search:
- Documents are split into manageable chunks
- Each chunk is converted into an embedding — a numerical vector capturing its semantic meaning
- Vectors are stored in a vector database or search index
2. Retrieval (runtime)
When a user query arrives:
- The query is converted to an embedding using the same model
- The most similar chunks are found by vector distance (cosine, dot product)
- The top-K most relevant chunks are selected
3. Generation (runtime)
- Retrieved chunks are injected into the prompt as context
- The LLM generates a response based on the query AND the provided context
- Optionally, citations to original sources are included
Advanced patterns
- Hybrid RAG: combines vector search with keyword search (BM25) for better coverage
- Iterative RAG: the agent performs multiple retrieval rounds, refining the search based on intermediate results
- RAG with reranking: a secondary model reorders search results by relevance before passing them to the LLM
- GraphRAG: uses knowledge graphs instead of (or in addition to) vector search
Why not just fine-tuning?
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Data updates | Immediate (change documents) | Requires retraining |
| Cost | Low (search infrastructure) | High (GPU, labeled data) |
| Traceability | High (source citations) | Low (knowledge in weights) |
| Hallucinations | Reduced (factual context) | Persist |
| Specialized knowledge | Good with good documents | Better for style/format |
In practice, many systems combine both: fine-tuning for style and format, RAG for factual knowledge.
Connection with llms.txt
The llms.txt standard is a practical example of RAG: it provides agents with a structured document they can retrieve and use as context to answer questions about a site or project.
Limitations
- Chunking quality: poorly split fragments produce irrelevant context
- Context limit: you can't inject infinite documents — the LLM's context window is finite
- Latency: the retrieval stage adds time to each query
- Garbage in, garbage out: if source documents have errors, the LLM will propagate them confidently
Why it matters
RAG is the most practical technique for giving LLMs access to up-to-date, domain-specific information without fine-tuning. It combines the model's generative capability with data retrieved in real time, reducing hallucinations and keeping responses grounded in verifiable sources.
References
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., 2020. The original paper that formalized RAG.
- From Local to Global: A Graph RAG Approach — Microsoft Research, 2024. GraphRAG for queries over complete corpora.
- RAGAS: Automated Evaluation of RAG — Es et al., 2023. Evaluation framework for RAG systems.