Vector search effectively delivers semantic similarity for retrieval augmented generation, but it does poorly with short keyword searches or out-of-domain search terms. Supplementing vector retrieval with keyword searches like BM25 and combining the results with a reranker is becoming the standard way to get the best of both worlds.
Rerankers are ML models that take a set of search results and reorder them to improve relevance. They examine the query paired with each candidate result in detail, which is computationally expensive but produces more accurate results than simple retrieval methods alone. This can be done either as a second stage on top of a single search (pull 100 results out of vector search, then ask the reranker to identify the top 10) or, more often, to combine results from different kinds of search; in this case, vector search and keyword search.
But how good are off-the-shelf rerankers? To find out, I tested six rerankers on the text from the
We tested these rerankers:
The rerankers were fed the top 20 results from both DPR and BM25, and the reranked NDCG@5 was evaluated.
In the results, raw vector search (with embeddings from the bge-m3 model) is labeled dpr (dense passage retrieval). BGE-m3 was chosen to compute embeddings because that’s what the ColPali authors used as a baseline.
Here’s the data on relevance (NDCG@5):
And here’s how fast they are at reranking searches in the arxiv dataset; latency is proportional to document length. This is graphing latency, so lower is better. The self-hosted bge model was run on an NVIDIA 3090 using the simplest possible code lifted straight from
Finally, here’s how much it cost with each model to rerank the almost 3,000 searches from all six datasets. Cohere prices per search (with additional fees for long documents), while the others price per token.
RRF adds little to no value to hybrid search scenarios; on half of the datasets, it performed worse than either BM25 or DPR alone. In contrast, all ML-based rerankers tested delivered meaningful improvements over pure vector or keyword search, with Voyage rerank-2 setting the bar for relevance.
Tradeoffs are still present: superior accuracy from Voyage rerank-2, faster processing from Cohere, or solid middle-ground performance from Jina or Voyage's lite model. Even the open-source BGE reranker, while trailing commercial options, adds significant value for teams choosing to self-host.
As foundation models continue advancing, we can expect even better performance. But today's ML rerankers are already mature enough to deploy with confidence across multilingual content.
By Jonathan Ellis, DataStax