Architectural pattern that combines information retrieval from external sources with LLM text generation, reducing hallucinations and keeping knowledge current without retraining the model.
RAG (Retrieval-Augmented Generation) is a pattern that improves LLM responses by injecting relevant information retrieved from external sources directly into the prompt context. Instead of relying solely on knowledge stored in the model's weights, the system searches for relevant documents and includes them as context before generating the response.
The concept was formalized by Lewis et al. in 2020, but has become the dominant pattern for enterprise generative AI applications.
The typical RAG flow has three stages:
Source documents are processed and stored for efficient search:
When a user query arrives:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
# 1. Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)
# 2. Indexing
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
# 3. Retrieval + Generation
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o"),
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
answer = qa.invoke("What is the return policy?")Chunking quality determines retrieval quality:
| Strategy | Typical size | Best for |
|---|---|---|
| Fixed size | 256-512 tokens | Homogeneous documents |
| Recursive by separators | 512-1,024 tokens | Structured text (Markdown, HTML) |
| Semantic | Variable | Documents where meaning crosses paragraphs |
| Per document | Full document | Short documents (FAQs, cards) |
An overlap of 10-20% between chunks helps preserve context at boundaries.
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Data updates | Immediate (change documents) | Requires retraining |
| Cost | Low (search infrastructure) | High (GPU, labeled data) |
| Traceability | High (source citations) | Low (knowledge in weights) |
| Hallucinations | Reduced (factual context) | Persist |
| Specialized knowledge | Good with good documents | Better for style/format |
In practice, many systems combine both: fine-tuning for style and format, RAG for factual knowledge.
The llms.txt standard is a practical example of RAG: it provides agents with a structured document they can retrieve and use as context to answer questions about a site or project.
RAG is the most practical technique for giving LLMs access to up-to-date, domain-specific information without fine-tuning. It combines the model's generative capability with data retrieved in real time, reducing hallucinations and keeping responses grounded in verifiable sources.
Information retrieval technique that uses vector embeddings to find results by meaning, not just exact keyword matching.
Proposed standard for publishing a Markdown file at a website's root that enables language models to efficiently understand and use the site's content at inference time.
Autonomous systems that combine language models with reasoning, memory, and tool use to execute complex multi-step tasks with minimal human intervention.
Dense vector representations that capture the semantic meaning of text, images, or other data in a numerical space where proximity reflects conceptual similarity.
Storage systems specialized in indexing and searching high-dimensional vectors efficiently, enabling semantic search and RAG applications at scale.
Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.
AWS serverless service providing access to foundation models from multiple providers (Anthropic, Meta, Mistral, Amazon) via unified API, without managing ML infrastructure.
The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.
Chronicle of building a second brain with a knowledge graph, bilingual pipeline, and agent endpoints — in days, not weeks, and what that teaches about the gap between theory and working systems.