Fast vs. accurate: Measuring the recall of quantized vector search

Explaining how to measure recall for vector search in Elasticsearch with minimal setup.

From vector search to powerful REST APIs, Elasticsearch offers developers the most extensive search toolkit. Dive into our sample notebooks in the Elasticsearch Labs repo to try something new. You can also start your free trial or run Elasticsearch locally today.

Everyone wants vector search to be instant. But high-dimensional vectors are heavy. A single 1,024-dimension float-32 vector takes up significant memory, and comparing it against millions of others is computationally expensive.

To solve this, search engines like Elasticsearch use two main optimization strategies:

  1. Approximate search (hierarchical navigable small world [HNSW]): Instead of scanning every document, we build a navigation graph to jump quickly to the likely neighborhood of the answer.
  2. Quantization: We compress the vectors (for example, from 32-bit floats to 8-bit integers or even 1-bit binary values) to reduce memory usage and speed up calculations.

But optimization often comes with a tax: accuracy.

The fear is valid: "If I compress my data and take shortcuts during the search, will I miss the best results?" "Does this optimization degrade the relevance of my search engine?"

To prove that Elastic’s quantization doesn’t degrade results, we built a repeatable test harness using the DBPedia-14 dataset to calculate exactly how much accuracy (specifically, recall) we trade for speed when using default optimizations in Elasticsearch.

tldr: It’s likely much less than you think. Check out the notebook here, and try it yourself

The definitions (for the non-experts)

Before we look at the code, let’s level-set on some terms.

  • Relevance versus recall: Relevance is subjective (did I find good stuff?). Recall is mathematical. If there are 10 documents in the database that are the perfect mathematical matches for your query, and the search engine finds nine of them, your recall is 90% (or 0.9).
  • Exact search (flat): Sometimes called the "brute force" method. The search engine scans every single document in an index and calculates the distance.
    • Pros: 100% perfect recall.
    • Cons: Computationally expensive and slow at scale.
  • Approximate search (HNSW): The "shortcut" method. The search engine builds an HNSW graph. It traverses the graph to find the nearest neighbors.
    • Pros: Extremely fast and scalable.
    • Cons: You might miss a neighbor if the graph traversal stops too early.

The experiment: Exact versus approximate

To test recall, we used the DBPedia-14 dataset, a large dataset of titles and abstracts across 14 ontology classes, commonly used for training and evaluating text categorization models. Specifically, we’ll focus on the "Film" category. We wanted to compare the optimized production settings against a mathematically perfect ground truth.

For this experiment, we are using the jina-embeddings-v5-text-small model, a state-of-the-art multilingual model that leads industry benchmarks for text representation. We chose this model because it defines the current standard for high-performance embeddings. By combining Jina v5’s elite accuracy with Elasticsearch’s native quantization, we can demonstrate a search architecture that is both computationally efficient and uncompromising on retrieval quality.

We set up an index with dual mapping. We ingested the same text into two different fields simultaneously:

  1. content.raw with type: flat. This forces Elasticsearch to perform a brute-force scan of the full Float32 vectors. This returns exact match results and will be used for our baseline.
  2. content with type semantic_text. With defaults using HNSW + Better Binary Quantization (BBQ). This is the standard, optimized production setting for approximate match.

The Recall@10 test

For our metric, we used Recall@10.

We picked 50 random movies and ran the same query against both fields.

  • If the exact (flat) search says the top 10 neighbors are IDs [1, 2, 3... 10].
  • And the approximate (HNSW) search returns IDs [1, 2, 3... 9, 99].
  • We found nine out of the top 10 correctly. The score is 0.9.

Here’s the mapping we used:

The results: The "flat line" of success

We ran a scale test, reloading the full dataset and testing against index sizes of 1,000 to 40,000 documents.

Here’s what happened to the recall score:

DocumentsRecall@10 score
1,0001.000 (100%)
5,0000.998 (100%)
10,0000.992 (99.4%)
20,0000.999 (99.0%)
40,0000.992 (98.8%)

The results were incredibly stable. Even as we scaled up, the approximate search matched the brute-force exact search >99% of the time.

Why did it work so well?

You might expect that compressing vectors to binary values would hurt accuracy more than this. The reason it doesn't lies in how Elasticsearch handles the retrieval.

Most embedding models today output Float32 vectors, which are large. To make search efficient, Elasticsearch uses quantization for high-dimensional vectors. Specifically, since 9.2, it uses BBQ by default.

BBQ uses a rescoring mechanism:

  1. Traversal: The search engine uses the compressed (quantized) vectors to traverse the HNSW graph quickly. Because the vectors are small, it can efficiently over-sample, gathering a larger list of candidates (for example, the top 100 roughly similar docs) without a performance penalty.
  2. Rescore: Once it has those candidates, it retrieves the full-precision values for just those few documents to calculate the final, precise ranking.

This gives you the best of both worlds, the speed of quantization for the heavy lifting, and the precision of floats for the final sort.

Can we do better?

It’s worth noting that the results we’re seeing here are using default settings and a random sampling of data. Think of this as a high-performance starting point. While Jina v5 is a beast, these recall scores aren't a "one size fits all" guarantee for every dataset. Every data collection has its own quirks, and while you can definitely tune things further to squeeze out even more performance, you should always benchmark against your own specific data to see where your ceiling is.

Conclusion

This is a very small-scale test. But the point of the exercise is not to measure the embedding model or BBQ specifically, it’s to demonstrate how you can easily measure the recall of your dataset with minimal setup.

If you want to run this test on your own data, you can check out the notebook here and try it yourself.

関連記事

最先端の検索体験を構築する準備はできましたか?

十分に高度な検索は 1 人の努力だけでは実現できません。Elasticsearch は、データ サイエンティスト、ML オペレーター、エンジニアなど、あなたと同じように検索に情熱を傾ける多くの人々によって支えられています。ぜひつながり、協力して、希望する結果が得られる魔法の検索エクスペリエンスを構築しましょう。

はじめましょう