Hybrid Retrieval - BM25 + Vector + Reranking

A pure dense-vector retriever is excellent at "find me passages that mean roughly the same thing" and surprisingly poor at "find me the passage that contains the exact string INV-2024-7831." A pure BM25 retriever is the opposite. Production RAG that actually works ends up running both, fusing the results, and then reranking the top candidates with a smaller, more accurate model. Each stage exists because the previous one has a known failure mode.

What each stage contributes

Stage	Tool	Strength	Latency budget
Lexical retrieval	BM25 in OpenSearch / Elasticsearch / Tantivy	Exact terms, named entities, codes, rare tokens	5-20 ms for top 100
Dense retrieval	HNSW over embeddings	Semantic match, paraphrases, multilingual	10-50 ms for top 100
Fusion	Reciprocal Rank Fusion (RRF)	Combine two ranked lists without score calibration	<1 ms
Reranking	Cross-encoder (bge-reranker, Cohere Rerank)	Final precision on top 20-50	50-200 ms for top 50

Total budget for a well-tuned pipeline: 100-300 ms before the LLM call. If you are seeing 1 s+ retrieval, one of the stages is misconfigured (almost always the reranker batch size).

Why BM25 still matters

BM25 (Robertson & Zaragoza 2009) is term-frequency weighted by inverse document frequency, with a saturating length normalisation. It is the default similarity in Elasticsearch and OpenSearch for a reason: it handles exact-match queries that embeddings literally cannot. Embeddings tokenise INV-2024-7831 into pieces and lose the identifier. BM25 indexes it as a token and retrieves the document that contains it on the first try.

Named entities, error codes, function names, SKUs, dates, version numbers - anything where the user is asking about a specific string - BM25 wins. For "what does our refund policy say about damaged items" - vector wins. You need both.

Reciprocal Rank Fusion

RRF (Cormack, Clarke, Buettcher SIGIR 2009) is the simplest combiner that works. For each document d appearing in any of the input ranked lists, compute:

RRF_score(d) = sum over lists L of  1 / (k + rank_L(d))

k = 60 is the standard. The brilliance is that you never need to calibrate the BM25 and cosine scores against each other - only ranks matter. Rank 1 in BM25 and rank 1 in dense contribute the same to the fused score. Rank 1 in one list and rank 100 in the other contribute much more than rank 50 in both. The method has no parameters to tune beyond k, and tuning k rarely matters above the noise floor.

# Bare-bones RRF over two ranked lists of doc_ids
def rrf(*ranked_lists, k=60):
    scores = {}
    for ranking in ranked_lists:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

fused = rrf(bm25_top100, dense_top100)[:50]

That is it. Linear combinations of normalised scores (alpha * bm25 + (1-alpha) * cosine) are tempting but require per-query calibration and break the moment one retriever's score distribution shifts. RRF is parameter-free and robust.

Cross-encoder reranking

A bi-encoder (the embedding model) encodes query and document independently, then takes a dot product. Fast, but it never sees the query and the document together; it cannot capture token-level interactions. A cross-encoder takes the concatenated (query, document) as input and runs a full forward pass - quadratic in length but vastly more accurate.

You cannot afford a cross-encoder over 1M documents. You can afford it over the top 50-100 from the fused stage. Standard picks:

bge-reranker-large / bge-reranker-v2-m3 (BAAI). Open-source, MIT-licensed, ~0.6B parameters. Strong English and Chinese. The default if you are self-hosting.
Cohere Rerank v3. Hosted API, very strong, supports 100+ languages and long documents. The default if you do not want to host a reranker.
Jina Reranker v2. Open-source, multilingual, smaller than bge.
Voyage rerank-2. Hosted, English-focused, strong on technical content.

Typical recall-at-10 lifts from adding a cross-encoder over an already-fused list: 5-15 points. Larger than most prompt-engineering changes.

The latency budget breakdown

A realistic budget for a 200 ms retrieval pipeline:

Stage	Budget	Notes
Query embed	15 ms	sentence-transformers all-MiniLM on GPU, or cached
BM25 top 100	15 ms	Single-node Elasticsearch is fine until you have billions of docs
Dense top 100	30 ms	HNSW with ef_search ~100
RRF	1 ms	Pure Python, nothing to optimise
Rerank top 50	120 ms	bge-reranker-large fp16, batch size 50, on A10/L4
Final top 5	-	Pass to LLM

The reranker dominates and is also where the biggest accuracy gains live. If you cannot afford 120 ms, run a smaller reranker (bge-reranker-base, ~110M params) or a smaller candidate set (top 20 instead of 50).

When it falls down

Reranker batch size is wrong. Calling the reranker 50 times for 50 candidates is 10-20x slower than one batched call. Always batch.
Different chunking between BM25 and dense. RRF assumes the documents are the same in both lists. If you chunked differently for the two indexes, half the deduplication breaks.
Reranker not aware of your domain. General rerankers do not know your product taxonomy. For high-value verticals, fine-tune the reranker on a few thousand of your own (query, relevant, irrelevant) triples. The lift is large.
RRF can be beaten by learned fusion. With enough training data, a learned-to-rank model over (bm25_score, cosine_score, recency, ...) outperforms RRF. Most teams do not have that data. RRF is the right default.