Elasticsearch DiskBBQ: 5x faster query quantization

Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.

Asymmetric quantization cuts the time Elasticsearch DiskBBQ spends quantizing queries by 5x. We discovered that too much time was spent quantizing queries. DiskBBQ started off quantizing queries with the same centroids as the indexed documents. However, we can make this cheaper by quantizing the queries with coarser-grained centroids. This improves query latency with very little observed recall impact in our tests.

How DiskBBQ uses two centroid tiers for asymmetric quantization

DiskBBQ now uses two centroid tiers (fine-grained document centroids and coarser query centroids) so queries are quantized once per parent centroid instead of once per document centroid.

The old mental model is "one centroid does everything for a posting list." The new model splits responsibilities:

Document centroids (fine-grained): Still used for posting-list structure and document centering.
Query centroids (coarser): A parent centroid reused across multiple document centroids.

So instead of quantizing the query independently for every document centroid we visit, we quantize per parent centroid and reuse that work across all of its children. Since we were already using two-tier clustering logic as the index size grew, it was a natural fit. We can reuse the work we already do during querying.

Diagram labeled “Asymmetric Quantization,” showing a blue circle marked “q” connected by arrows to two dashed green circles labeled “Parent P0” and “Parent P1.” Each parent contains four smaller white circles, and a green oval at the bottom reads “2 quantizations.”

Diagram labeled “Symmetric quantization,” showing a blue circle marked “q” at the top with arrows pointing to eight white circles arranged in two rows. An orange oval at the bottom reads “8 quantizations.”

These images are a simple representation of our goal: Quantizing per centroid gives us overhead per centroid. Let’s get rid of it!

The goal is to significantly reduce the number of times we actually need to quantize a given query.

The math behind asymmetric BBQ in Elasticsearch

To center the data prior to computing quantized query and document vectors, $q$ and $d$ , we rewrite the dot product $q^td$ as $(m+q-m)^t(m+d-m)$ and expand. We can perform exactly the same operation but using different centroids for the query vector $q$ and document vector $d$ . Specifically,

As for standard Better Binary Quantization (BBQ), we quantize $q-m_q$ and $d-m_d$ in order to estimate the per (document, query) pair component of the dot product. The quantities $m_d ^tq$ and $m_q ^t(d-m_d)$ are scalars so just two extra additions per dot product we compute. For $m_d ^tq$ , we compute naturally when finding the nearest centroid. For $m_q ^t(d-m_d)$ , this can be stored with the quantized document vectors, which are just 4 bytes overhead. Below, we’ll discuss how to manage the other term on the fly.

Asymmetric BBQ in DiskBBQ

We cluster the document centroids (using k-means, for example) into $k_q<k_d$ clusters, for $k_q$ and $k_d$ the query and document centroid count, respectively. This means there’s a many-to-one mapping from document centroids to query centroids. We’ll denote the document centroids by their index $i ∈ [k_d]$ and define this mapping to the query centroids as $j : [k_d]$ → $[k_q]$ .

Since there’s a unique query centroid for each document centroid, we only need to cache one value for $m_q^t(d-m_d)$ per quantized document vector, that is, for each document vector $d_k$ in posting list $i$ , we need to cache $m_j(_i)^t(y_k-m_i)$ with the quantized document vector.

When we come to compute the dot products between a query and the document vectors in a cluster, we look up the quantized query vector corresponding to $q-m_j(_i)$ and we compute $m_i^tq$ once and use it to process the whole posting list. The quantization process is significantly more expensive than computing the dot product, so this is a big net win.

The $(q-m_q)^t(d-m_d)$ term is estimated using the usual BBQ machinery, that is, these vectors will be quantized and the dot product value estimated from the quantized vectors. Then we can use (1) to compute the final dot product estimate. Notice that this means we only need to quantize the query at most $k_q$ times. Furthermore, we typically visit many centroids from the same parent centroid in a search because they’re close to one another.

Euclidean distance corrections for asymmetric quantization

For Euclidean, we can write $||q-d ||^2=||q||^2+||d||^2-2q^td$ and treat the $q^td$ term exactly as above. In fact, there’s a slightly nicer form. Substituting, we have that:

We can rewrite this as follows:

The corrective terms are the norm of query vector $q$ minus the document centroid $m_d$ , the norm of the document vector $d$ minus the query centroid $m_q$ , and the norm of the difference of query and document centroids. As before, $||d-m_q||^2-||m_d-m_q||^2$ can be stored as a single float with each document.

What changed in DiskBBQ indexing and scoring

At indexing/merge time, centroids can be clustered into parent groups when centroid count is large enough. Posting metadata moved from "centroid ordinal + centroid score" to a shape that explicitly carries query-centroid ordinal and document-centroid score. That decoupling is what lets scoring read documents and query centering from different places. For Euclidean, let’s break it down further by our mathematics above:

=||q-m_d||^2+||d-m_q||^2-||m_d-m_q||^2

$||q-m_d||^2$ <- This is the distance from a “query vector $q$ ” to “document centroid $m_d$ ”. We already gather this when we find the nearest centroids during querying. No new work.

$||d-m_q||^2$ <- This is the distance from “document vector $d$ ” to “query centroid $m_q$ ”. However, recalling our original quantization work, this can simply replace a previously stored float value. No new storage is required.

$||m_d-m_q||^2$ <- This is just the distance between query centroid $m_q$ and document centroid $m_d$ . This is just a single extra floating point value per postings list.

The practical change for dot product spaces is even simpler; the only correction value change is $m_j(_i)^t(y_k-m_i)$ being stored instead of $y^tm_i$ .

These changes don’t introduce new computation costs and marginally reduce storage costs because we no longer quantize queries with document centroids. Those raw centroids don’t need to be present with the posting lists.

One cost we did add is a small cache of quantized query values. This is to account for clustering edge cases. For example, it's possible that query $q$ is very close to query centroid $qc_0 =\{dc_0, dc_1, dc_2\}$ but not quite as close as $qc_1 = \{dc_3, dc_4, dc_5\}$ . That said, the actual nearest three document centroids could have a relative order: $[dc_0, dc_3, dc_1]$ . So, to prevent the query from being quantized twice, we keep a limited cache of the most recent quantized values for a given query.

Diagram showing a blue circle labeled “q” connected by colored arrows to two dashed oval regions. The green oval contains orange circles labeled dc_0–dc_2 and a green diamond labeled qc_0; and the purple oval contains pink circles labeled dc_3–dc_5 and a purple diamond labeled qc_1. Arrows illustrate relationships between q and the cluster components.

Here’s a visualization of the situation described above. In the typical iteration scenario, we don’t want to risk unnecessarily quantizing the query against the same query centroid multiple times.

DiskBBQ asymmetric quantization: performance results

The flame graphs below show a before and after comparison. Before, about 20% of the time was spent quantizing queries when we visited each cluster. After our adjustment, it dropped to about 4%.

Flame graph showing computational costs using symmetric quantization, with stacked colored blocks labeled for Elasticsearch and JDK vectorization functions. Each block’s width represents relative processing time, and the tooltip highlights quantization activity within Elasticsearch query code. — Flame graph showing the computational costs using symmetric quantization.

Flame graph showing reduced computational time spent on quantization after introducing asymmetric quantization, with stacked colored blocks labeled for Elasticsearch and JDK vectorization functions. Each block’s width represents relative processing time, and a tooltip highlights quantization activity within Elasticsearch query code. — Flame graph showing the reduction in computational time spent on quantization when asymmetric quantization is introduced.

Of course, the bulk of the cost is still just scoring the vectors in each cluster. But every little bit helps.

Here’s a better view of the full end-to-end performance and recall. The data set was 1 million DBpedia docs encoded with the GTE-Base model. Here, “sec” indicates the number of clusters per secondary (parent) cluster. Note that symmetric quantization is still impacted by the secondary cluster size as it also impacts the two-tier clustering indexing we do already.

Line chart titled “Latency vs Recall Pareto (sec = 16),” comparing asymmetric and symmetric quantization. The blue asymmetric line shows higher recall at each latency value than the red symmetric line, indicating improved latency with minimal recall impact. Axes are labeled “Latency (ms)” and “Recall.” — This shows the benefit of asymmetric quantization: Removing that cost improves latency with very little, if any, recall impact.

However, the impact on our current index structure is still dominated by centroid scoring and scoring vectors in the cluster. Asymmetric quantization removes a frustratingly expensive part of our scoring overhead, but the impact isn’t dramatic given our current structure.

What's next for DiskBBQ quantization

This simple piece of mathematics decouples our query quantization from our document quantization, giving us better storage efficiency and faster queries. This is in Elasticsearch Serverless now and will be in Elastic Stack version 9.4.0.

This now means that query quantization time isn’t a direct concern for future decisions. We can make larger index changes without worrying about the consistent overhead of quantization directly with document centroids.

This was a nerdy one. I hope you survived all the math (and that I copied it all down correctly). It’s always fun to be able to tackle complex problems with simple mathematics, and the results are actually positive in real use cases and data.

Quão útil foi este conteúdo?

Não útil

Um pouco útil

Muito útil

Reportar um problema

Conteúdo relacionado

A picture is worth 1.5x the words: What we learned benchmarking product search embeddings

Vector Database Relevance+1

16 de julho de 2026

A picture is worth 1.5x the words: What we learned benchmarking product search embeddings

We benchmarked two embedding models on 5,000 real products and found that combining image and text beats either alone by up to 50%. Here's the data and the model that won.

Por: Sofia Vasileva

The disk that never woke up: what actually decided our Qdrant vector search benchmark rematch

Vector Database

13 de julho de 2026

The disk that never woke up: what actually decided our Qdrant vector search benchmark rematch

On the same hardware, Elasticsearch and Qdrant land in the same range at 56 QPS. The io_uring disk scorer and memory claims turned out to be the two things that mattered least.

Por: Jim Ferenczi

4 NVIDIA AI tasks, 1 Elasticsearch API: Embeddings, chat, completion, and rerank

Integrations Vector Database+1

21 de julho de 2026

4 NVIDIA AI tasks, 1 Elasticsearch API: Embeddings, chat, completion, and rerank

Set up NVIDIA hosted models in Elasticsearch with one API key and a model ID. No custom integration code needed.

Por: Jan Kazlouski

How BBQ shrinks Jina v5 embeddings by 29x without losing recall in Elasticsearch

Vector Database Jina AI+1

10 de julho de 2026

How BBQ shrinks Jina v5 embeddings by 29x without losing recall in Elasticsearch

A hands-on test comparing BBQ and float32 vector indices in Elasticsearch, measuring memory, disk and recall@10 across five languages.

Por: Jeffrey Rengifo

Short queries, formal documents: how HyDE improved semantic search precision by 50% in Elasticsearch

AI Vector Database

7 de julho de 2026

Short queries, formal documents: how HyDE improved semantic search precision by 50% in Elasticsearch

HyDE boosts semantic search precision and recall by 50% on short queries. Here's how to implement it in Elasticsearch with the Inference API and semantic_text.