Cutting Elasticsearch DiskBBQ query quantization time by 5x

See how asymmetric quantization cuts DiskBBQ query quantization overhead from about 20% to 4% with little recall impact.

From vector search to powerful REST APIs, Elasticsearch offers developers the most extensive search toolkit. Dive into our sample notebooks in the Elasticsearch Labs repo to try something new. You can also start your free trial or run Elasticsearch locally today.

Asymmetric quantization cuts the time Elasticsearch DiskBBQ spends quantizing queries by 5x. We discovered that too much time was spent quantizing queries. DiskBBQ started off quantizing queries with the same centroids as the indexed documents. However, we can make this cheaper by quantizing the queries with coarser-grained centroids. This improves query latency with very little observed recall impact in our tests.

How DiskBBQ uses two centroid tiers for asymmetric quantization

DiskBBQ now uses two centroid tiers (fine-grained document centroids and coarser query centroids) so queries are quantized once per parent centroid instead of once per document centroid.

The old mental model is "one centroid does everything for a posting list." The new model splits responsibilities:

  • Document centroids (fine-grained): Still used for posting-list structure and document centering.
  • Query centroids (coarser): A parent centroid reused across multiple document centroids.

So instead of quantizing the query independently for every document centroid we visit, we quantize per parent centroid and reuse that work across all of its children. Since we were already using two-tier clustering logic as the index size grew, it was a natural fit. We can reuse the work we already do during querying.

These images are a simple representation of our goal: Quantizing per centroid gives us overhead per centroid. Let’s get rid of it!

The goal is to significantly reduce the number of times we actually need to quantize a given query.

The math behind asymmetric BBQ in Elasticsearch

To center the data prior to computing quantized query and document vectors, qq and dd, we rewrite the dot product qtdq^td as (m+qm)t(m+dm)(m+q-m)^t(m+d-m) and expand. We can perform exactly the same operation but using different centroids for the query vector qq and document vector dd. Specifically,

As for standard Better Binary Quantization (BBQ), we quantize qmqq-m_q and dmdd-m_d in order to estimate the per (document, query) pair component of the dot product. The quantities mdtqm_d ^tq and mqt(dmd)m_q ^t(d-m_d) are scalars so just two extra additions per dot product we compute. For mdtqm_d ^tq, we compute naturally when finding the nearest centroid. For mqt(dmd)m_q ^t(d-m_d), this can be stored with the quantized document vectors, which are just 4 bytes overhead. Below, we’ll discuss how to manage the other term on the fly.

Asymmetric BBQ in DiskBBQ

We cluster the document centroids (using k-means, for example) into kq<kdk_q<k_d clusters, for kqk_q and kdk_d the query and document centroid count, respectively. This means there’s a many-to-one mapping from document centroids to query centroids. We’ll denote the document centroids by their index i[kd]i ∈ [k_d] and define this mapping to the query centroids as j:[kd]j : [k_d][kq][k_q].

Since there’s a unique query centroid for each document centroid, we only need to cache one value for mqt(dmd)m_q^t(d-m_d) per quantized document vector, that is, for each document vector dkd_k in posting list ii, we need to cache mj(i)t(ykmi)m_j(_i)^t(y_k-m_i) with the quantized document vector.

When we come to compute the dot products between a query and the document vectors in a cluster, we look up the quantized query vector corresponding to qmj(i)q-m_j(_i) and we compute mitqm_i^tq once and use it to process the whole posting list. The quantization process is significantly more expensive than computing the dot product, so this is a big net win.

The (qmq)t(dmd)(q-m_q)^t(d-m_d) term is estimated using the usual BBQ machinery, that is, these vectors will be quantized and the dot product value estimated from the quantized vectors. Then we can use (1) to compute the final dot product estimate. Notice that this means we only need to quantize the query at most kqk_q times. Furthermore, we typically visit many centroids from the same parent centroid in a search because they’re close to one another.

Euclidean distance corrections for asymmetric quantization

For Euclidean, we can write qd2=q2+d22qtd||q-d ||^2=||q||^2+||d||^2-2q^td and treat the qtdq^td term exactly as above. In fact, there’s a slightly nicer form. Substituting, we have that:

We can rewrite this as follows:

The corrective terms are the norm of query vector qq minus the document centroid mdm_d, the norm of the document vector dd minus the query centroid mqm_q, and the norm of the difference of query and document centroids. As before, dmq2mdmq2||d-m_q||^2-||m_d-m_q||^2 can be stored as a single float with each document.

What changed in DiskBBQ indexing and scoring

At indexing/merge time, centroids can be clustered into parent groups when centroid count is large enough. Posting metadata moved from "centroid ordinal + centroid score" to a shape that explicitly carries query-centroid ordinal and document-centroid score. That decoupling is what lets scoring read documents and query centering from different places. For Euclidean, let’s break it down further by our mathematics above:

=qmd2+dmq2mdmq2 =||q-m_d||^2+||d-m_q||^2-||m_d-m_q||^2

qmd2||q-m_d||^2 <- This is the distance from a “query vector qq” to “document centroid mdm_d”. We already gather this when we find the nearest centroids during querying. No new work.

dmq2||d-m_q||^2 <- This is the distance from “document vector dd” to “query centroid mqm_q”. However, recalling our original quantization work, this can simply replace a previously stored float value. No new storage is required.

mdmq2||m_d-m_q||^2 <- This is just the distance between query centroid mqm_q and document centroid mdm_d. This is just a single extra floating point value per postings list.

The practical change for dot product spaces is even simpler; the only correction value change is mj(i)t(ykmi)m_j(_i)^t(y_k-m_i) being stored instead of ytmiy^tm_i.

These changes don’t introduce new computation costs and marginally reduce storage costs because we no longer quantize queries with document centroids. Those raw centroids don’t need to be present with the posting lists.

One cost we did add is a small cache of quantized query values. This is to account for clustering edge cases. For example, it's possible that query qq is very close to query centroid qc0={dc0,dc1,dc2}qc_0 =\{dc_0, dc_1, dc_2\} but not quite as close as qc1={dc3,dc4,dc5}qc_1 = \{dc_3, dc_4, dc_5\}. That said, the actual nearest three document centroids could have a relative order: [dc0,dc3,dc1][dc_0, dc_3, dc_1]. So, to prevent the query from being quantized twice, we keep a limited cache of the most recent quantized values for a given query.

Here’s a visualization of the situation described above. In the typical iteration scenario, we don’t want to risk unnecessarily quantizing the query against the same query centroid multiple times.

DiskBBQ asymmetric quantization: performance results

The flame graphs below show a before and after comparison. Before, about 20% of the time was spent quantizing queries when we visited each cluster. After our adjustment, it dropped to about 4%.

Of course, the bulk of the cost is still just scoring the vectors in each cluster. But every little bit helps.

Here’s a better view of the full end-to-end performance and recall. The data set was 1 million DBpedia docs encoded with the GTE-Base model. Here, “sec” indicates the number of clusters per secondary (parent) cluster. Note that symmetric quantization is still impacted by the secondary cluster size as it also impacts the two-tier clustering indexing we do already.

However, the impact on our current index structure is still dominated by centroid scoring and scoring vectors in the cluster. Asymmetric quantization removes a frustratingly expensive part of our scoring overhead, but the impact isn’t dramatic given our current structure.

What's next for DiskBBQ quantization

This simple piece of mathematics decouples our query quantization from our document quantization, giving us better storage efficiency and faster queries. This is in Elasticsearch Serverless now and will be in Elastic Stack version 9.4.0.

This now means that query quantization time isn’t a direct concern for future decisions. We can make larger index changes without worrying about the consistent overhead of quantization directly with document centroids.

This was a nerdy one. I hope you survived all the math (and that I copied it all down correctly). It’s always fun to be able to tackle complex problems with simple mathematics, and the results are actually positive in real use cases and data.

Quão útil foi este conteúdo?

Não útil

Um pouco útil

Muito útil

Conteúdo relacionado

Pronto para criar buscas de última geração?

Uma pesquisa suficientemente avançada não se consegue apenas com o esforço de uma só pessoa. O Elasticsearch é impulsionado por cientistas de dados, especialistas em operações de aprendizado de máquina, engenheiros e muitos outros que são tão apaixonados por buscas quanto você. Vamos nos conectar e trabalhar juntos para construir a experiência de busca mágica que lhe trará os resultados desejados.

Experimente você mesmo(a)