Scalar quantization 101

Understand what scalar quantization is, how it works and its benefits. This guide also covers the math behind quantization and examples.

Introduction to scalar quantization

Most embedding models output float32float32 vector values. While this provides the highest fidelity, it is wasteful given the information that is actually important in the vector. Within a given data set, embeddings never require all 2 billion options for each individual dimension. This is especially true on higher dimensional vectors (e.g. 386 dimensions and higher). Quantization allows for vectors to be encoded in a lossy manner, thus reducing fidelity slightly with huge space savings.

Understanding buckets in scalar quantization

Scalar quantization takes each vector dimension and buckets them into some smaller data type. For the rest of the blog, we will assume quantizing float32float32 values into int8int8. To bucket values accurately, it isn't as simple as rounding the floating point values to the nearest integer. Many models output vectors that have dimensions continuously on the range [1.0,1.0][-1.0, 1.0]. So, two different vector values 0.123 and 0.321 could both be rounded down to 0. Ultimately, a vector would only use 2 of its 255 available buckets in int8int8, losing too much information.

Figure 1: Illustration of quantization goals, bucketing continuous values from 1.0-1.0 to 1.01.0 into discrete int8int8 values.

The math behind the numerical transformation isn't too complicated. Since we can calculate the minimum and maximum values for the floating point range, we can use min-max normalization and then linearly shift the values.

int8127maxmin×(float32min)int8 \approx \frac{127}{max - min} \times (float32 - min)
float32maxmin127×int8+minfloat32 \approx \frac{max - min}{127} \times int8 + min

Figure 2: Equations for transforming between int8int8 and float32float32. Note, these are lossy transformations and not exact. In the following examples, we are only using positive values within int8. This aligns with the Lucene implementation.

The role of statistics in scalar quantization

A quantile is a slice of a distribution that contains a certain percentage of the values. So, for example, it may be that 99%99\% of our floating point values are between [0.75,0.86][-0.75, 0.86] instead of the true minimum and maximum values of [1.0,1.0][-1.0, 1.0]. Any values less than -0.75 and greater than 0.86 are considered outliers. If you include outliers when attempting to quantize results, you will have fewer available buckets for your most common values. And fewer buckets can mean less accuracy and thus greater loss of information.

Figure 3: Illustration of the 99%99\% confidence interval and the individual quantile values. 99%99\%% of all values fall within the range [0.75,0.86][-0.75, 0.86].

This is all well and good, but now that we know how to quantize values, how can we actually calculate distances between two quantized vectors? Is it as simple as a regular dot_product?

The role of algebra in scalar quantization

We are still missing one vital piece, how do we calculate the distance between two quantized vectors. While we haven't shied away from math yet in this blog, we are about to do a bunch more. Time to break out your pencils and try to remember polynomials and basic algebra.

The basic requirement for dot_product and cosine similarity is being able to multiply floating point values together and sum up their results. We already know how to transform between float32float32 and int8int8 values, so what does multiplication look like with our transformations?

float32i×float32i(maxmin127×int8i+min)×(maxmin127×int8i+min)float32_i \times float32'_i \approx (\frac{max - min}{127} \times int8_i + min) \times (\frac{max - min}{127} \times int8'_i + min)

We can then expand this multiplication and to simplify we will substitute α\alpha for maxmin127\frac{max - min}{127}.

α2×int8i×int8i+α×int8i×min+α×int8i×min+min2\alpha^2 \times int8_i \times int8'_i + \alpha \times int8_i \times min + \alpha \times int8'_i \times min + min^2

What makes this even more interesting, is that only one part of this equation requires both values at the same time. However, dot_product isn't just two floats being multiplied, but all the floats for each dimension of the vector. With vector dimension count dimdim in hand, all the following can be pre-calculated at query time and storage time.

dim×α2dim\times\alpha^2 is just dim×(maxmin127)2dim\times(\frac{max-min}{127})^2 and can be stored as a single float value.

i=0dim1min×α×int8i\sum_{i=0}^{dim-1}min\times\alpha\times int8_i and i=0dim1min×α×int8i\sum_{i=0}^{dim-1}min\times\alpha\times int8'_i can be pre-calculated and stored as a single float value or calculated once at query time.

dim×min2dim\times min^2 can be pre-calculated and stored as a single float value.

Of all this:

dim×α2×dotProduct(int8,int8)+i=0dim1min×α×int8i+i=0dim1min×α×int8i+dim×min2dim \times \alpha^2 \times dotProduct(int8, int8') + \sum_{i=0}^{dim-1}min\times\alpha\times int8_i + \sum_{i=0}^{dim-1}min\times\alpha\times int8'_i + dim\times min^2

The only calculation required for dot_product is just dotProduct(int8,int8)dotProduct(int8, int8') with some pre-calculated values combined with the result.

Ensuring accuracy in quantization

So, how is this accurate at all? Aren't we losing information by quantizing? Yes, we are, but quantization takes advantage of the fact that we don't need all the information. For learned embeddings models, the distributions of the various dimensions usually don't have fat-tails. This means they are localized and fairly consistent. Additionaly, the error introduced per dimension via quantization is independent. Meaning, the error cancels out for our typical vector operations like dot_product.

Conclusion

Whew, that was a ton to cover. But now you have a good grasp of the technical benefits of quantization, the math behind it, and how you can calculate the distances between vectors while accounting for the linear transformation. Look next at how we implemented this in Lucene and some of the unique challenges and benefits available there.

Ready to try this out on your own? Start a free trial.

Elasticsearch and Lucene offer strong vector database and search capabilities. Dive into our sample notebooks to learn more.

Related content

Apache Lucene 9.9, the fastest Lucene release ever

December 7, 2023

Apache Lucene 9.9, the fastest Lucene release ever

Lucene 9.9 brings major speedups to query evaluation. Here are the performance improvements observed in nightly benchmarks & optimization resources.

Elasticsearch vs. OpenSearch: Vector Search Performance Comparison

Elasticsearch vs. OpenSearch: Vector Search Performance Comparison

Elasticsearch is out-of-the-box 2x–12x faster than OpenSearch for vector search

Bringing maximum-inner-product into Lucene

September 1, 2023

Bringing maximum-inner-product into Lucene

Explore how we brought maximum-inner-product into Lucene and the investigations undertaken to ensure its support.

Understanding Int4 scalar quantization in Lucene

April 25, 2024

Understanding Int4 scalar quantization in Lucene

This blog explains how int4 quantization works in Lucene, how it lines up, and the benefits of using int4 quantization.

Making Lucene faster with vectorization and FFI/madvise

April 17, 2024

Making Lucene faster with vectorization and FFI/madvise

Discover how modern Java features, including vectorization and FFI/madvise, are speeding up Lucene's performance.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself