← Concept library

Large Language Models

Retrieval Augmented Generation

The end-to-end RAG pipeline from chunking through retrieval, reranking, and grounded generation.

intermediate · 9 min read

RAG augments an LLM with external knowledge by retrieving relevant text at query time and injecting it into the prompt. The basic loop is simple; the production version has more knobs than you might expect.

The pipeline

  1. Ingestion. Split documents into chunks (typically 200-800 tokens with 50-100 token overlap).
  2. Embedding. Encode each chunk with a sentence-embedding model and store the vectors in an index.
  3. Retrieval. At query time, embed the user query and find the top-k nearest chunks.
  4. Reranking. Optionally rescore the top-k with a more expensive cross-encoder.
  5. Generation. Build a prompt that includes the retrieved chunks as context and ask the LLM to answer.

Chunking strategy matters

Bad chunking destroys retrieval. Common pitfalls:
- Fixed-token chunks that split sentences mid-clause.
- No overlap so important context that straddles boundaries is lost.
- Too small chunks lose context; too large chunks dilute signal.

Recursive splitting on paragraph -> sentence -> token boundaries is the safe default.

Hybrid retrieval

Pure vector search misses exact-keyword matches. Pure BM25 misses semantic matches. The Reciprocal Rank Fusion of both (RRF) typically beats either alone.

Common failure modes

  • Hallucinated citations. The model invents quotes that look like they came from the retrieved chunks. Mitigate by asking the model to quote verbatim and verify post-hoc.
  • Lost-in-the-middle. Models attend best to the start and end of long contexts. Re-order retrieved chunks so the most relevant are at the prompt edges.
  • Stale index. If your index updates lag your knowledge base, expect contradictions.