Evaluating scalar quantization in Elasticsearch

Seamlessly connect with leading AI and machine learning platforms. Start a free cloud trial to explore Elastic’s gen AI capabilities or try it on your machine now.

Understanding scalar quantization in Elasticsearch

In 8.13 we introduced scalar quantization to Elasticsearch. By using this feature an end-user can provide float vectors that are internally indexed as byte vectors while retaining the float vectors in the index for optional re-scoring. This means they can reduce their index memory requirement, which is its dominant cost, by a factor of four. At the moment this is an opt-in feature feature, but we believe it constitutes a better trade off than indexing vectors as floats. In 8.14 we will switch to make this our default. However, before doing this we wanted a systematic evaluation of the quality impact.

Experimentation: Evaluating scalar quantization

The multilingual E5-small is a small high quality multilingual passage embedding model that we offer out-of-the-box in Elasticsearch. It has two versions: one cross-platform version which runs on any hardware and one version which is optimized for CPU inference in the Elastic Stack (see here). E5 represents a challenging case for automatic quantization because the vectors it produces have low angular variation and are relatively low dimension compared to state-of-the-art models. If we can achieve little to no damage enabling int8 quantization for this model we can be confident that it will work reliably.

The purpose of this experimentation is to estimate the effects of scalar-quantized kNN search as described here across a broad range of retrieval tasks using this model. More specifically, our aim is to assess the performance degradation (if any) by switching from a full-precision to a quantized index.

Overview of methodology

For the evaluation we relied upon BEIR and for each dataset that we considered we built a full precision and an int8-quantized index using the default hyperparameters (m: 16, ef_construction: 100). First, we experimented with the quantized (weights only) version of the multilingual E5-small model provided by Elastic here with Table 1 presenting a summary of the nDCG@10 scores (k:10, num_candidates:100):

Dataset	Full precision	Int8 quantization	Absolute difference	Relative difference
Arguana	0.37	0.362	-0.008	-2.16%
FiQA-2018	0.309	0.304	-0.005	-1.62%
NFCorpus	0.302	0.297	-0.005	-1.66%
Quora	0.876	0.875	-0.001	-0.11%
SCIDOCS	0.135	0.132	-0.003	-2.22%
Scifact	0.649	0.644	-0.005	-0.77%
TREC-COVID	0.683	0.672	-0.011	-1.61%
		Average	-0.005	-1.05%

Table 1: nDCG@10 scores for the full precision and int8 quantization indices across a selection of BEIR datasets

Overall, it seems that there is a slight relative decrease of 1.05% on average.

Next, we considered repeating the same evaluation process using the unquantized version of multilingual E5-small (see model card here) and Table 2 shows the respective results.

Dataset	Full precision	Int8 quantization	Absolute difference	Relative difference
Arguana	0.384	0.379	-0.005	-1.3%
Climate-FEVER	0.214	0.222	+0.008	+3.74%
FEVER	0.718	0.715	-0.003	-0.42%
FiQA-2018	0.328	0.324	-0.004	-1.22%
NFCorpus	0.31	0.306	-0.004	-1.29%
NQ	0.548	0.537	-0.011	-2.01%
Quora	0.882	0.881	-0.001	-0.11%
Robust04	0.418	0.415	-0.003	-0.72%
SCIDOCS	0.134	0.132	-0.003	-1.49%
Scifact	0.67	0.666	-0.004	-0.6%
TREC-COVID	0.709	0.693	-0.016	-2.26%
		Average	-0.004	-0.83%

Table 2: nDCG@10 scores of multilingual-E5-small on a selection of BEIR datasets

Again, we observe a slight relative decrease in performance equal to 0.83%. Finally, we repeated the exercise for multilingual E5-base and the performance decrease was even smaller (0.59%)

But this is not the whole story: The increased efficiency of the quantized HNSW indices and the fact that the original float vectors are still retained in the index allows us to recover a significant portion of the lost performance through rescoring. More specifically, we can retrieve a larger pool of candidates through approximate kNN search in the quantized index, which is quite fast, and then compute the similarity function on the original float vectors and re-score accordingly.

As a proof of concept, we consider the NQ dataset which exhibited a large performance decrease (2.01%) with multilingual E5-small. By setting k=15, num_candidates=100 and window_size=10 (as we are interested in nDCG@10) we get an improved score of 0.539 recovering about 20% of the performance. If we further increase the num_candidates parameter to 200 then we get a score that matches the performance of the full precision index but with faster response times. The same setup on Arguana leads to an increase from 0.379 to 0.382 and thus limiting the relative performance drop from 1.3% to only 0.52%

Results

The results of our evaluation suggest that scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch without significant loss in retrieval performance. The performance decrease is more pronounced for smaller vectors (multilingual E5-small produces vectors of size equal to 384 while E5-base gives 768-dimensional embeddings), but this can be mitigated through rescoring. We are confident that scalar quantization will be beneficial for most users and we plan to make it the default in 8.14.

Frequently Asked Questions

What are the benefits of using scalar quantization in Elasticsearch?

The benefits of using scalar quantization in Elasticsearch include reducing the memory footprint of vector embeddings without significantly affecting retrieval performance.

How helpful was this content?

Not helpful

Somewhat helpful

Very helpful

Report an issue

Related Content

How BBQ shrinks Jina v5 embeddings by 29x without losing recall in Elasticsearch

Vector Database Jina AI+1

July 10, 2026

How BBQ shrinks Jina v5 embeddings by 29x without losing recall in Elasticsearch

A hands-on test comparing BBQ and float32 vector indices in Elasticsearch, measuring memory, disk and recall@10 across five languages.

By: Jeffrey Rengifo

The hash() Elasticsearch won't name and the 12 bytes that prove it's Murmur3

Basics Lucene

June 25, 2026

The hash() Elasticsearch won't name and the 12 bytes that prove it's Murmur3

Elasticsearch's routing formula uses MurmurHash3, but the docs never say so. This post names the function, walks through the full shard calculation, and shows you how to reproduce it externally.

By: Sachin Frayne

Elasticsearch DiskBBQ delivers 7x faster vector search than Qdrant on network-attached storage

Vector Database AI+1

June 24, 2026

Elasticsearch DiskBBQ delivers 7x faster vector search than Qdrant on network-attached storage

Elasticsearch DiskBBQ achieves up to 7x higher vector search throughput than Qdrant at comparable recall on network-attached storage. Explore the benchmark methodology and full results.

By: Sachin Frayne

Elasticsearch query logs: One coordinator-level line per query for ES|QL, DSL, SQL, and EQL

Inside Elastic Basics+1

May 12, 2026

Elasticsearch query logs: One coordinator-level line per query for ES|QL, DSL, SQL, and EQL

Easily understand query impact on cluster performance with Elasticsearch query logs. One coordinator-level line records ES|QL, DSL, SQL, and EQL per request and provides full query text, tracing, optional user context, and CCS hints

NH VC

By: Najwa Harif and Valentin Crettaz