← Concept library

LLM Systems

Hybrid Retrieval - BM25 + Vector + Reranking

Why pure vector search misses exact-match queries, how RRF combines lexical and semantic results, and where a cross-encoder reranker buys back the precision you lost.

intermediate · 8 min read

A pure dense-vector retriever is excellent at "find me passages that mean roughly the same thing" and surprisingly poor at "find me the passage that contains the exact string INV-2024-7831." A pure BM25 retriever is the opposite. Production RAG that actually works ends up running both, fusing the results, and then reranking the top candidates with a smaller, more accurate model. Each stage exists because the previous one has a known failure mode.

What each stage contributes

Stage Tool Strength Latency budget
Lexical retrieval BM25 in OpenSearch / Elasticsearch / Tantivy Exact terms, named entities, codes, rare tokens 5-20 ms for top 100
Dense retrieval HNSW over embeddings Semantic match, paraphrases, multilingual 10-50 ms for top 100
Fusion Reciprocal Rank Fusion (RRF) Combine two ranked lists without score calibration <1 ms
Reranking Cross-encoder (bge-reranker, Cohere Rerank) Final precision on top 20-50 50-200 ms for top 50

Total budget for a well-tuned pipeline: 100-300 ms before the LLM call. If you are seeing 1 s+ retrieval, one of the stages is misconfigured (almost always the reranker batch size).

Why BM25 still matters

BM25 (Robertson & Zaragoza 2009) is term-frequency weighted by inverse document frequency, with a saturating length normalisation. It is the default similarity in Elasticsearch and OpenSearch for a reason: it handles exact-match queries that embeddings literally cannot. Embeddings tokenise INV-2024-7831 into pieces and lose the identifier. BM25 indexes it as a token and retrieves the document that contains it on the first try.

Named entities, error codes, function names, SKUs, dates, version numbers - anything where the user is asking about a specific string - BM25 wins. For "what does our refund policy say about damaged items" - vector wins. You need both.

Reciprocal Rank Fusion

RRF (Cormack, Clarke, Buettcher SIGIR 2009) is the simplest combiner that works. For each document d appearing in any of the input ranked lists, compute:

RRF_score(d) = sum over lists L of  1 / (k + rank_L(d))

k = 60 is the standard. The brilliance is that you never need to calibrate the BM25 and cosine scores against each other - only ranks matter. Rank 1 in BM25 and rank 1 in dense contribute the same to the fused score. Rank 1 in one list and rank 100 in the other contribute much more than rank 50 in both. The method has no parameters to tune beyond k, and tuning k rarely matters above the noise floor.

# Bare-bones RRF over two ranked lists of doc_ids
def rrf(*ranked_lists, k=60):
    scores = {}
    for ranking in ranked_lists:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

fused = rrf(bm25_top100, dense_top100)[:50]

That is it. Linear combinations of normalised scores (alpha * bm25 + (1-alpha) * cosine) are tempting but require per-query calibration and break the moment one retriever's score distribution shifts. RRF is parameter-free and robust.

Cross-encoder reranking

A bi-encoder (the embedding model) encodes query and document independently, then takes a dot product. Fast, but it never sees the query and the document together; it cannot capture token-level interactions. A cross-encoder takes the concatenated (query, document) as input and runs a full forward pass - quadratic in length but vastly more accurate.

You cannot afford a cross-encoder over 1M documents. You can afford it over the top 50-100 from the fused stage. Standard picks:

  • bge-reranker-large / bge-reranker-v2-m3 (BAAI). Open-source, MIT-licensed, ~0.6B parameters. Strong English and Chinese. The default if you are self-hosting.
  • Cohere Rerank v3. Hosted API, very strong, supports 100+ languages and long documents. The default if you do not want to host a reranker.
  • Jina Reranker v2. Open-source, multilingual, smaller than bge.
  • Voyage rerank-2. Hosted, English-focused, strong on technical content.

Typical recall-at-10 lifts from adding a cross-encoder over an already-fused list: 5-15 points. Larger than most prompt-engineering changes.

The latency budget breakdown

A realistic budget for a 200 ms retrieval pipeline:

Stage Budget Notes
Query embed 15 ms sentence-transformers all-MiniLM on GPU, or cached
BM25 top 100 15 ms Single-node Elasticsearch is fine until you have billions of docs
Dense top 100 30 ms HNSW with ef_search ~100
RRF 1 ms Pure Python, nothing to optimise
Rerank top 50 120 ms bge-reranker-large fp16, batch size 50, on A10/L4
Final top 5 - Pass to LLM

The reranker dominates and is also where the biggest accuracy gains live. If you cannot afford 120 ms, run a smaller reranker (bge-reranker-base, ~110M params) or a smaller candidate set (top 20 instead of 50).

When it falls down

  • Reranker batch size is wrong. Calling the reranker 50 times for 50 candidates is 10-20x slower than one batched call. Always batch.
  • Different chunking between BM25 and dense. RRF assumes the documents are the same in both lists. If you chunked differently for the two indexes, half the deduplication breaks.
  • Reranker not aware of your domain. General rerankers do not know your product taxonomy. For high-value verticals, fine-tune the reranker on a few thousand of your own (query, relevant, irrelevant) triples. The lift is large.
  • RRF can be beaten by learned fusion. With enough training data, a learned-to-rank model over (bm25_score, cosine_score, recency, ...) outperforms RRF. Most teams do not have that data. RRF is the right default.

Further reading