Concepts

Retrieval-Augmented Generation

Architectural pattern that combines information retrieval from external sources with LLM text generation, reducing hallucinations and keeping knowledge current without retraining the model.

seed#rag#llm#embeddings#vector-search#information-retrieval#ai-architecture

What it is

RAG (Retrieval-Augmented Generation) is a pattern that improves LLM responses by injecting relevant information retrieved from external sources directly into the prompt context. Instead of relying solely on knowledge stored in the model's weights, the system searches for relevant documents and includes them as context before generating the response.

The concept was formalized by Lewis et al. in 2020, but has become the dominant pattern for enterprise generative AI applications.

How it works

The typical RAG flow has three stages:

1. Indexing (offline)

Source documents are processed and stored for efficient search:

  • Documents are split into manageable chunks
  • Each chunk is converted into an embedding — a numerical vector capturing its semantic meaning
  • Vectors are stored in a vector database or search index

2. Retrieval (runtime)

When a user query arrives:

  • The query is converted to an embedding using the same model
  • The most similar chunks are found by vector distance (cosine, dot product)
  • The top-K most relevant chunks are selected

3. Generation (runtime)

  • Retrieved chunks are injected into the prompt as context
  • The LLM generates a response based on the query AND the provided context
  • Optionally, citations to original sources are included

Advanced patterns

  • Hybrid RAG: combines vector search with keyword search (BM25) for better coverage
  • Iterative RAG: the agent performs multiple retrieval rounds, refining the search based on intermediate results
  • RAG with reranking: a secondary model reorders search results by relevance before passing them to the LLM
  • GraphRAG: uses knowledge graphs instead of (or in addition to) vector search

Why not just fine-tuning?

AspectRAGFine-tuning
Data updatesImmediate (change documents)Requires retraining
CostLow (search infrastructure)High (GPU, labeled data)
TraceabilityHigh (source citations)Low (knowledge in weights)
HallucinationsReduced (factual context)Persist
Specialized knowledgeGood with good documentsBetter for style/format

In practice, many systems combine both: fine-tuning for style and format, RAG for factual knowledge.

Connection with llms.txt

The llms.txt standard is a practical example of RAG: it provides agents with a structured document they can retrieve and use as context to answer questions about a site or project.

Limitations

  • Chunking quality: poorly split fragments produce irrelevant context
  • Context limit: you can't inject infinite documents — the LLM's context window is finite
  • Latency: the retrieval stage adds time to each query
  • Garbage in, garbage out: if source documents have errors, the LLM will propagate them confidently

Why it matters

RAG is the most practical technique for giving LLMs access to up-to-date, domain-specific information without fine-tuning. It combines the model's generative capability with data retrieved in real time, reducing hallucinations and keeping responses grounded in verifiable sources.

References

Concepts