Scalar quantization 101

Understand what scalar quantization is, how it works and its benefits. This guide also covers the math behind quantization and examples.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running! You can start a free cloud trial or try Elastic on your local machine now.

Introduction to scalar quantization

Most embedding models output float32float32 vector values. While this provides the highest fidelity, it is wasteful given the information that is actually important in the vector. Within a given data set, embeddings never require all 2 billion options for each individual dimension. This is especially true on higher dimensional vectors (e.g. 386 dimensions and higher). Quantization allows for vectors to be encoded in a lossy manner, thus reducing fidelity slightly with huge space savings.

Understanding buckets in scalar quantization

Scalar quantization takes each vector dimension and buckets them into some smaller data type. For the rest of the blog, we will assume quantizing float32float32 values into int8int8. To bucket values accurately, it isn't as simple as rounding the floating point values to the nearest integer. Many models output vectors that have dimensions continuously on the range [1.0,1.0][-1.0, 1.0]. So, two different vector values 0.123 and 0.321 could both be rounded down to 0. Ultimately, a vector would only use 2 of its 255 available buckets in int8int8, losing too much information.

Figure 1: Illustration of quantization goals, bucketing continuous values from 1.0-1.0 to 1.01.0 into discrete int8int8 values.

The math behind the numerical transformation isn't too complicated. Since we can calculate the minimum and maximum values for the floating point range, we can use min-max normalization and then linearly shift the values.

int8127maxmin×(float32min)int8 \approx \frac{127}{max - min} \times (float32 - min)
float32maxmin127×int8+minfloat32 \approx \frac{max - min}{127} \times int8 + min

Figure 2: Equations for transforming between int8int8 and float32float32. Note, these are lossy transformations and not exact. In the following examples, we are only using positive values within int8. This aligns with the Lucene implementation.

The role of statistics in scalar quantization

A quantile is a slice of a distribution that contains a certain percentage of the values. So, for example, it may be that 99%99\% of our floating point values are between [0.75,0.86][-0.75, 0.86] instead of the true minimum and maximum values of [1.0,1.0][-1.0, 1.0]. Any values less than -0.75 and greater than 0.86 are considered outliers. If you include outliers when attempting to quantize results, you will have fewer available buckets for your most common values. And fewer buckets can mean less accuracy and thus greater loss of information.

Figure 3: Illustration of the 99%99\% confidence interval and the individual quantile values. 99%99\%% of all values fall within the range [0.75,0.86][-0.75, 0.86].

This is all well and good, but now that we know how to quantize values, how can we actually calculate distances between two quantized vectors? Is it as simple as a regular dot_product?

The role of algebra in scalar quantization

We are still missing one vital piece, how do we calculate the distance between two quantized vectors. While we haven't shied away from math yet in this blog, we are about to do a bunch more. Time to break out your pencils and try to remember polynomials and basic algebra.

The basic requirement for dot_product and cosine similarity is being able to multiply floating point values together and sum up their results. We already know how to transform between float32float32 and int8int8 values, so what does multiplication look like with our transformations?

float32i×float32i(maxmin127×int8i+min)×(maxmin127×int8i+min)float32_i \times float32'_i \approx (\frac{max - min}{127} \times int8_i + min) \times (\frac{max - min}{127} \times int8'_i + min)

We can then expand this multiplication and to simplify we will substitute α\alpha for maxmin127\frac{max - min}{127}.

α2×int8i×int8i+α×int8i×min+α×int8i×min+min2\alpha^2 \times int8_i \times int8'_i + \alpha \times int8_i \times min + \alpha \times int8'_i \times min + min^2

What makes this even more interesting, is that only one part of this equation requires both values at the same time. However, dot_product isn't just two floats being multiplied, but all the floats for each dimension of the vector. With vector dimension count dimdim in hand, all the following can be pre-calculated at query time and storage time.

dim×α2dim\times\alpha^2 is just dim×(maxmin127)2dim\times(\frac{max-min}{127})^2 and can be stored as a single float value.

i=0dim1min×α×int8i\sum_{i=0}^{dim-1}min\times\alpha\times int8_i and i=0dim1min×α×int8i\sum_{i=0}^{dim-1}min\times\alpha\times int8'_i can be pre-calculated and stored as a single float value or calculated once at query time.

dim×min2dim\times min^2 can be pre-calculated and stored as a single float value.

Of all this:

dim×α2×dotProduct(int8,int8)+i=0dim1min×α×int8i+i=0dim1min×α×int8i+dim×min2dim \times \alpha^2 \times dotProduct(int8, int8') + \sum_{i=0}^{dim-1}min\times\alpha\times int8_i + \sum_{i=0}^{dim-1}min\times\alpha\times int8'_i + dim\times min^2

The only calculation required for dot_product is just dotProduct(int8,int8)dotProduct(int8, int8') with some pre-calculated values combined with the result.

Ensuring accuracy in quantization

So, how is this accurate at all? Aren't we losing information by quantizing? Yes, we are, but quantization takes advantage of the fact that we don't need all the information. For learned embeddings models, the distributions of the various dimensions usually don't have fat-tails. This means they are localized and fairly consistent. Additionaly, the error introduced per dimension via quantization is independent. Meaning, the error cancels out for our typical vector operations like dot_product.

Conclusion

Whew, that was a ton to cover. But now you have a good grasp of the technical benefits of quantization, the math behind it, and how you can calculate the distances between vectors while accounting for the linear transformation. Look next at how we implemented this in Lucene and some of the unique challenges and benefits available there.

자주 묻는 질문

What are the benefits of scalar quantization?

Quantization allows for vectors to be encoded in a lossy manner, thus reducing fidelity slightly with huge space savings.

How does scalar quantization work?

Scalar quantization takes each vector dimension and buckets them into some smaller data type.

관련 콘텐츠

최첨단 검색 환경을 구축할 준비가 되셨나요?

충분히 고급화된 검색은 한 사람의 노력만으로는 달성할 수 없습니다. Elasticsearch는 여러분과 마찬가지로 검색에 대한 열정을 가진 데이터 과학자, ML 운영팀, 엔지니어 등 많은 사람들이 지원합니다. 서로 연결하고 협력하여 원하는 결과를 얻을 수 있는 마법 같은 검색 환경을 구축해 보세요.

직접 사용해 보세요