Retrieval Augmented Generation

RAG augments an LLM with external knowledge by retrieving relevant text at query time and injecting it into the prompt. The basic loop is simple; the production version has more knobs than you might expect.

The pipeline

Ingestion. Split documents into chunks (typically 200-800 tokens with 50-100 token overlap).
Embedding. Encode each chunk with a sentence-embedding model and store the vectors in an index.
Retrieval. At query time, embed the user query and find the top-k nearest chunks.
Reranking. Optionally rescore the top-k with a more expensive cross-encoder.
Generation. Build a prompt that includes the retrieved chunks as context and ask the LLM to answer.

Chunking strategy matters

Bad chunking destroys retrieval. Common pitfalls:
- Fixed-token chunks that split sentences mid-clause.
- No overlap so important context that straddles boundaries is lost.
- Too small chunks lose context; too large chunks dilute signal.

Recursive splitting on paragraph -> sentence -> token boundaries is the safe default.

Hybrid retrieval

Pure vector search misses exact-keyword matches. Pure BM25 misses semantic matches. The Reciprocal Rank Fusion of both (RRF) typically beats either alone.

Common failure modes

Hallucinated citations. The model invents quotes that look like they came from the retrieved chunks. Mitigate by asking the model to quote verbatim and verify post-hoc.
Lost-in-the-middle. Models attend best to the start and end of long contexts. Re-order retrieved chunks so the most relevant are at the prompt edges.
Stale index. If your index updates lag your knowledge base, expect contradictions.