Large Language Models
Retrieval Augmented Generation
The end-to-end RAG pipeline from chunking through retrieval, reranking, and grounded generation.
intermediate · 9 min read
RAG augments an LLM with external knowledge by retrieving relevant text at query time and injecting it into the prompt. The basic loop is simple; the production version has more knobs than you might expect.
The pipeline
- Ingestion. Split documents into chunks (typically 200-800 tokens with 50-100 token overlap).
- Embedding. Encode each chunk with a sentence-embedding model and store the vectors in an index.
- Retrieval. At query time, embed the user query and find the top-k nearest chunks.
- Reranking. Optionally rescore the top-k with a more expensive cross-encoder.
- Generation. Build a prompt that includes the retrieved chunks as context and ask the LLM to answer.
Chunking strategy matters
Bad chunking destroys retrieval. Common pitfalls:
- Fixed-token chunks that split sentences mid-clause.
- No overlap so important context that straddles boundaries is lost.
- Too small chunks lose context; too large chunks dilute signal.
Recursive splitting on paragraph -> sentence -> token boundaries is the safe default.
Hybrid retrieval
Pure vector search misses exact-keyword matches. Pure BM25 misses semantic matches. The Reciprocal Rank Fusion of both (RRF) typically beats either alone.
Common failure modes
- Hallucinated citations. The model invents quotes that look like they came from the retrieved chunks. Mitigate by asking the model to quote verbatim and verify post-hoc.
- Lost-in-the-middle. Models attend best to the start and end of long contexts. Re-order retrieved chunks so the most relevant are at the prompt edges.
- Stale index. If your index updates lag your knowledge base, expect contradictions.