LLM Systems
Hybrid Retrieval - BM25 + Vector + Reranking
Why pure vector search misses exact-match queries, how RRF combines lexical and semantic results, and where a cross-encoder reranker buys back the precision you lost.
intermediate · 8 min read
A pure dense-vector retriever is excellent at "find me passages that mean roughly the same thing" and surprisingly poor at "find me the passage that contains the exact string INV-2024-7831." A pure BM25 retriever is the opposite. Production RAG that actually works ends up running both, fusing the results, and then reranking the top candidates with a smaller, more accurate model. Each stage exists because the previous one has a known failure mode.
What each stage contributes
| Stage | Tool | Strength | Latency budget |
|---|---|---|---|
| Lexical retrieval | BM25 in OpenSearch / Elasticsearch / Tantivy | Exact terms, named entities, codes, rare tokens | 5-20 ms for top 100 |
| Dense retrieval | HNSW over embeddings | Semantic match, paraphrases, multilingual | 10-50 ms for top 100 |
| Fusion | Reciprocal Rank Fusion (RRF) | Combine two ranked lists without score calibration | <1 ms |
| Reranking | Cross-encoder (bge-reranker, Cohere Rerank) | Final precision on top 20-50 | 50-200 ms for top 50 |
Total budget for a well-tuned pipeline: 100-300 ms before the LLM call. If you are seeing 1 s+ retrieval, one of the stages is misconfigured (almost always the reranker batch size).
Why BM25 still matters
BM25 (Robertson & Zaragoza 2009) is term-frequency weighted by inverse document frequency, with a saturating length normalisation. It is the default similarity in Elasticsearch and OpenSearch for a reason: it handles exact-match queries that embeddings literally cannot. Embeddings tokenise INV-2024-7831 into pieces and lose the identifier. BM25 indexes it as a token and retrieves the document that contains it on the first try.
Named entities, error codes, function names, SKUs, dates, version numbers - anything where the user is asking about a specific string - BM25 wins. For "what does our refund policy say about damaged items" - vector wins. You need both.
Reciprocal Rank Fusion
RRF (Cormack, Clarke, Buettcher SIGIR 2009) is the simplest combiner that works. For each document d appearing in any of the input ranked lists, compute:
RRF_score(d) = sum over lists L of 1 / (k + rank_L(d))
k = 60 is the standard. The brilliance is that you never need to calibrate the BM25 and cosine scores against each other - only ranks matter. Rank 1 in BM25 and rank 1 in dense contribute the same to the fused score. Rank 1 in one list and rank 100 in the other contribute much more than rank 50 in both. The method has no parameters to tune beyond k, and tuning k rarely matters above the noise floor.
# Bare-bones RRF over two ranked lists of doc_ids
def rrf(*ranked_lists, k=60):
scores = {}
for ranking in ranked_lists:
for rank, doc_id in enumerate(ranking, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
fused = rrf(bm25_top100, dense_top100)[:50]
That is it. Linear combinations of normalised scores (alpha * bm25 + (1-alpha) * cosine) are tempting but require per-query calibration and break the moment one retriever's score distribution shifts. RRF is parameter-free and robust.
Cross-encoder reranking
A bi-encoder (the embedding model) encodes query and document independently, then takes a dot product. Fast, but it never sees the query and the document together; it cannot capture token-level interactions. A cross-encoder takes the concatenated (query, document) as input and runs a full forward pass - quadratic in length but vastly more accurate.
You cannot afford a cross-encoder over 1M documents. You can afford it over the top 50-100 from the fused stage. Standard picks:
- bge-reranker-large / bge-reranker-v2-m3 (BAAI). Open-source, MIT-licensed, ~0.6B parameters. Strong English and Chinese. The default if you are self-hosting.
- Cohere Rerank v3. Hosted API, very strong, supports 100+ languages and long documents. The default if you do not want to host a reranker.
- Jina Reranker v2. Open-source, multilingual, smaller than bge.
- Voyage rerank-2. Hosted, English-focused, strong on technical content.
Typical recall-at-10 lifts from adding a cross-encoder over an already-fused list: 5-15 points. Larger than most prompt-engineering changes.
The latency budget breakdown
A realistic budget for a 200 ms retrieval pipeline:
| Stage | Budget | Notes |
|---|---|---|
| Query embed | 15 ms | sentence-transformers all-MiniLM on GPU, or cached |
| BM25 top 100 | 15 ms | Single-node Elasticsearch is fine until you have billions of docs |
| Dense top 100 | 30 ms | HNSW with ef_search ~100 |
| RRF | 1 ms | Pure Python, nothing to optimise |
| Rerank top 50 | 120 ms | bge-reranker-large fp16, batch size 50, on A10/L4 |
| Final top 5 | - | Pass to LLM |
The reranker dominates and is also where the biggest accuracy gains live. If you cannot afford 120 ms, run a smaller reranker (bge-reranker-base, ~110M params) or a smaller candidate set (top 20 instead of 50).
When it falls down
- Reranker batch size is wrong. Calling the reranker 50 times for 50 candidates is 10-20x slower than one batched call. Always batch.
- Different chunking between BM25 and dense. RRF assumes the documents are the same in both lists. If you chunked differently for the two indexes, half the deduplication breaks.
- Reranker not aware of your domain. General rerankers do not know your product taxonomy. For high-value verticals, fine-tune the reranker on a few thousand of your own (query, relevant, irrelevant) triples. The lift is large.
- RRF can be beaten by learned fusion. With enough training data, a learned-to-rank model over
(bm25_score, cosine_score, recency, ...)outperforms RRF. Most teams do not have that data. RRF is the right default.
Further reading
- Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods - Cormack, Clarke, Buettcher SIGIR 2009 - the four-page paper that defines the method.
- BAAI/bge-reranker-large on Hugging Face - the open-source default for English/Chinese reranking.
- Elasticsearch similarity (BM25) documentation - the canonical implementation with its three tuning knobs (k1, b, discount_overlaps).
- Weaviate hybrid search docs - one of the cleanest end-to-end implementations of BM25 + vector + RRF in a single query.