In 2023, there have been:
The promise of truly semantic search for retrieval and retrieval augmented generation appeals a lot to users, large and small. So it's no surprise that vector search has been a major theme for Apache Lucene in 2023. More specifically, many interesting features and optimizations have been added across several releases:
Indexing is about organizing data in such a way that it can be efficiently accessed at search time, which involves a lot of sorting in practice. And when it comes to sorting, radix sort is king (when applicable!). Lucene had already been using radix sort in a few performance-sensitive places for a while, such as sorting the terms dictionary of segments. But usage of radix sort further increased in 2023, and it began being used to optimize:
TermInSetQuery
construction,We already covered some performance improvements for vector search, but keyword search saw major speedups as well in 2023. Check out this blog, which covers major speedups that occurred across the 9.7, 9.8 and 9.9 releases. These improvements apply both to traditional keyword search and sparse vector search, such as created by learned sparse retrieval models.
As a Java library, Lucene relies a lot on the Java virtual machine (JVM), and once in a while new features get released that are especially interesting for Lucene. Two features in particular have been integrated in such a way that if you run on a modern enough version of the JVM, then they will be used automatically:
MemorySegment
API is an improved API to mmap files into memory.It's hard to draw a line, but I'll stop here as I struggle to find common themes for other good changes I'm looking at that happened in 2023. :) Stay tuned for a great year 2024 in Apache Lucene land!
]]>What is especially interesting here is that these optimizations do not only benefit some very specific cases, they translate into actual speedups in Lucene's nightly benchmarks, which aim at tracking the performance of queries that are representative of the real world. Just hover on annotations to see where a speedup (or slowdown sometimes!) is coming from. By the way, special thanks to Mike McCandless for maintaining Lucene's nightly benchmarks on his own time and hardware for almost 13 years now!
Here are some speedups that nightly benchmarks observed between Lucene 9.6 (May 2023) and Lucene 9.9 (December 2023):
In case you are curious about these changes, here are resources that describe some of the optimizations that we applied:
Lucene 9.9 was just released and is expected to be integrated into Elasticsearch 8.12, which should get released soon. Stay tuned!
]]>the
or fox
, and the
has a maximum contribution of 0.2 to the score while the minimum competitive score is 0.5, then there is no point in evaluating hits that only contain the
anymore, as they have no chance of making it to the top hits.
However, WAND and MAXSCORE come with different performance characteristics: WAND typically evaluates fewer hits than MAXSCORE but with a higher per-hit overhead. This makes MAXSCORE generally perform better on high k values or with many terms - when skipping hits is hard, and WAND perform better otherwise.
While Lucene first implemented a variant of WAND called block-max WAND, it later got attracted to the lower overhead of MAXSCORE and started using block-max MAXSCORE for top-level disjunctions in July 2022 (annotation EN in Lucene's nightly benchmarks). The MAXSCORE algorithm is rather simple: it sorts terms by increasing maximum impact score and partitions them into two groups: essential terms and non-essential terms. Non-essential terms are terms with low maximum impact scores whose sum of maximum scores is less than the minimum competitive score. Essential terms are all other terms. Essential terms are used to find candidate matches while non-essential terms are only used to compute the score of a candidate.
Let's take an example: you are searching for the quick fox
, and the maximum impact scores of each term are respectively 0.2 for the
, 0.5 for quick
and 1 for fox
. As you start collecting hits, the minimum competitive score is 0, so all terms are essential and no terms are non-essential. Then at some point, the minimum competitive score reaches e.g. 0.3, meaning that a hit that only contains the
has no chance of making it to the top-k hits. the
moves from the set of essential terms to the set of non-essential terms, and the query effectively runs as (the) +(quick fox)
. The +
sign here is used to express that a query clause is required, such as in Lucene's classic query parser. Said another way, from that point on, the query will only match hits that contain quick
or fox
and will only use the
to compute the final score. The below table summarizes cases that MAXSCORE considers:
Minimum competitive score interval | Query runs as |
---|---|
[0, 0.2] | +(the quick fox) |
(0.2, 0.7] | (the) +(quick fox) |
(0.7, 1.7] | (the quick) +(fox) |
(1.7, +Infty) | No more matches |
The last case happens when the minimum competitive score is greater than the sum of all maximum impact scores across all terms. It typically never happens with regular MAXSCORE, but may happen on some blocks with block-max MAXSCORE.
Something that WAND does better than MAXSCORE is to progressively evaluate queries less and less as a disjunction and more and more as a conjunction as the minimum competitive score increases, which yields more skipping. This raised the question of whether MAXSCORE can be improved to also intersect terms? The answer is yes: for instance if the minimum competitive score is 1.3, then a hit cannot be competitive if it doesn't match both quick
and fox
. So we modified our block-max MAXSCORE implementation to consider the following cases instead:
Minimum competitive score interval | Query runs as |
---|---|
[0, 0.2] | +(the quick fox) |
(0.2, 0.7] | (the) +(quick fox) |
(0.7, 1.2] | (the quick) +(fox) |
(1.2, 1.5] | (the) +quick +fox |
(1.5, 1.7] | +the +quick +fox |
(1.7, +Infty) | No more matches |
Now the interesting question is whether these new cases we added are likely to occur or not? The answer depends on how good your score upper bounds are, your actual k value, whether terms actually have matches in common, etc., but it seems to kick in especially often in practice on queries that either have two terms, or that combine two high-scoring terms and zero or more low-scoring terms (e.g. stop words), such as the query we looked at in the above example. This is expected to cover a sizable number of queries in many query logs.
Implementing this optimization yielded a noticeable improvement on Lucene's nightly benchmarks (annotation FU), see OrHighHigh (11% speedup) and OrHighMed (6% speedup). It was released in Lucene 9.9 and should be included in Elasticsearch 8.12. We hope you'll enjoy the speedups!
Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., & Zien, J. (2003, November). Efficient query evaluation using a two-level retrieval process. In Proceedings of the twelfth international conference on Information and knowledge management (pp. 426-434). ↩
Turtle, H., & Flood, J. (1995). Query evaluation: strategies and optimizations. Information Processing & Management, 31(6), 831-850. ↩
Multiply and add is a common operation that computes the product of two numbers and adds that product with a third number. These types of operations are performed over and over during vector similarity computations.
a * b + c
Fused multiply-add (FMA) is a single operation that performs both the multiply and add operations in one - the multiplication and addition are said to be “fused” together. FMA is typically faster than a separate multiplication and addition because most CPUs model it as a single instruction.
FMA also produces more accurate results. Separate multiply and add operations on floating-point numbers have two rounds; one for the multiplication, and one for the addition, since they are separate instructions that need to produce separate results. That is effectively,
round(round(a * b) + c)
Whereas FMA has a single rounding, which applies only to the combined result of the multiplication and addition. That is effectively,
round(a * b + c)
Within the FMA instruction the a * b
produces an infinite precision intermediate result that is
added with c
, before the final result is rounded. This eliminates a single round, when compared to
separate multiply and add operations, which results in more accuracy.
So what has actually changed? In Lucene we have replaced the separate multiply and add operations with a single FMA operation. The scalar variants now use Math::fma, while the Panama vectorized variants use FloatVector::fma.
If we look at the disassembly we can see the effect that this change has had. Previously we saw this kind of code pattern for the Panama vectorized implementation of dot product.
vmovdqu32 zmm0,ZMMWORD PTR [rcx+r10*4+0x10]
vmulps zmm0,zmm0,ZMMWORD PTR [rdx+r10*4+0x10]
vaddps zmm4,zmm4,zmm0
The vmovdqu32
instruction loads 512 bits of packed doubleword values from a memory location into
the zmm0
register. The vmulps
instruction then multiplies the values in zmm0
with the
corresponding packed values from a memory location, and stores the result in zmm0
. Finally, the
vaddps
instruction adds the 16 packed single precision floating-point values in zmm0
with the
corresponding values in zmm4
, and stores the result in zmm4
.
With the change to use FloatVector::fma
, we see the following pattern:
vmovdqu32 zmm0,ZMMWORD PTR [rdx+r11*4+0xd0]
vfmadd231ps zmm4,zmm0,ZMMWORD PTR [rcx+r11*4+0xd0]
Again, the first instruction is similar to the previous example, where it loads 512 bits of packed
doubleword values from a memory location into the zmm0
register. The vfmadd231ps
(this is the
FMA instruction), multiplies the values in zmm0
with the corresponding packed values from a memory
location, adds that intermediate result to the values in zmm4
, performs rounding and stores the
resulting 16 packed single precision floating-point values in zmm4
.
The vfmadd231ps
instruction is doing quite a lot! It’s a clear signal of intent to the CPU about
the nature of the computations that the code is running. Given this, the CPU can make smarter
decisions about how this is done, which typically results in improved performance (and accuracy as
previously described).
In general, the use of FMA typically results in improved performance. But as always you need to benchmark! Thankfully, Lucene deals with quite a bit of complexity when determining whether to use FMA or not, so you don’t have to. Things like, whether the CPU even has support for FMA, if FMA is enabled in the Java Virtual Machine, and only enabling FMA on architectures that have proven to be faster than separate multiply and add operations. As you can probably tell, this heuristic is not perfect, but goes a long way to making the out-of-the-box experience good. While accuracy is improved with FMA, we see no negative effect on pre-existing similarity computations when FMA is not enabled.
Along with the use of FMA, the suite of vector similarity functions got some (more) love. All of dot product, square, and cosine distance, both the scalar and Panama vectorized variants have been updated. Optimizations have been applied based on the inspection of disassembly and empirical experiments, which have brought improvements that help fill the pipeline keeping the CPU busy; mostly through more consistent and targeted loop unrolling, as well as removal of data dependencies within loops.
It’s not straightforward to put concrete performance improvement numbers on this change, since the effect spans multiple similarity functions and variants, but we see positive throughput improvements, from single digit percentages in floating-point dot product, to higher double digit percentage improvements in cosine. The byte based similarity functions also show similar throughput improvements.
In Lucene 9.7.0, we added the ability to enable an alternative faster implementation of the low-level primitive operations used by Vector Search through SIMD instructions. In the upcoming Lucene 9.9.0 we built upon this to leverage faster FMA instructions, as well as to apply optimization techniques more consistently across all the similarity functions. Previous versions of Elasticsearch are already benefiting from SIMD, and the upcoming Elasticsearch 8.12.0 will have the FMA improvements.
Finally, I'd like to call out Lucene PMC member Robert Muir for continuing to make improvements in this area, and for the enjoyable and productive collaboration.
]]>While HNSW is a powerful and flexible way to store and search vectors, it does require a significant amount of memory to run quickly. For example, querying 1MM float32 vectors of 768 dimensions requires roughly $$1,000,000 * 4 * (768 + 12) = 3120000000 bytes \approx 3GB$$ of ram. Once you start searching a significant number of vectors, this gets expensive. One way to use around $$75%$$ less memory is through byte quantization. Lucene and consequently Elasticsearch has supported indexing $$byte$$ vectors for some time, but building these vectors has been the user's responsibility. This is about to change, as we have introduced $$int8$$ scalar quantization in Lucene.
All quantization techniques are considered lossy transformations of the raw data. Meaning some information is lost for the sake of space. For an in depth explanation of scalar quantization, see: Scalar Quantization 101. At a high level, scalar quantization is a lossy compression technique. Some simple math gives significant space savings with very little impact on recall.
Those used to working with Elasticsearch may be familiar with these concepts already, but here is a quick overview of the distribution of documents for search.
Each Elasticsearch index is composed of multiple shards. While each shard can only be assigned to a single node, multiple shards per index gives you compute parallelism across nodes.
Each shard is composed as a single Lucene Index. A Lucene index consists of multiple read-only segments. During indexing, documents are buffered and periodically flushed into a read-only segment. When certain conditions are met, these segments can be merged in the background into a larger segment. All of this is configurable and has its own set of complexities. But, when we talk about segments and merging, we are talking about read-only Lucene segments and the automatic periodic merging of these segments. Here is a deeper dive into segment merging and design decisions.
Every segment in Lucene stores the following: the individual vectors, the HNSW graph indices, the quantized vectors, and the calculated quantiles. For brevity's sake, we will focus on how Lucene stores quantized and raw vectors. For every segment, we keep track of the raw vectors in the $$vec$$ file, quantized vectors and a single corrective multiplier float in $$veq$$, and the metadata around the quantization within the $$vemq$$ file.
<cite> Figure 1: Simplified layout of raw vector storage file. Takes up $$dimension * 4 * numVectors$$ of disk space since $$float$$ values are 4 bytes. Because we are quantizing, these will not get loaded during HNSW search. They are only used if specifically requested (e.g. brute-force secondary via rescore), or for re-quantization during segment merge. </cite>
<cite> Figure 2: Simplified layout of the $$.veq$$ file. Takes up $$(dimension + 4)*numVectors$$ of space and will be loaded into memory during search. The $$+ 4$$ bytes is to account for the corrective multiplier float, used to adjust scoring for better accuracy and recall. </cite>
<cite> Figure 3: The simplified layout of the metadata file. Here is where we keep track of quantization and vector configuration along with the calculated quantiles for this segment. </cite>
So, for each segment, we store not only the quantized vectors, but the quantiles used in making these quantized vectors and the original raw vectors. But, why do we keep the raw vectors around at all?
Since Lucene periodically flushes to read only segments, each segment only has a partial view of all your data. This means the quantiles calculated only directly apply for that sample set of your entire data. Now, this isn't a big deal if your sample adequately represents your entire corpus. But Lucene allows you to sort your index in various ways. So, you could be indexing data sorted in a way that adds bias for per-segment quantile calculations. Also, you can flush the data whenever you like! Your sample set could be tiny, even just one vector. Yet another wrench is that you have control over when merges occur. While Elasticsearch has configured defaults and periodic merging, you can ask for a merge whenever you like via _force_merge API. So how do we still allow all this flexibility, while providing good quantization that provides good recall?
Lucene's vector quantization will automatically adjust over time. Because Lucene is designed with a read-only segment architecture, we have guarantees that the data in each segment hasn't changed and clear demarcations in the code for when things can be updated. This means during segment merge we can adjust quantiles as necessary and possibly re-quantize vectors.
<cite> Figure 4: Three example segments with different quantiles. </cite>
But isn't re-quantization expensive? It does have some overhead, but Lucene handles quantiles intelligently, and only fully-requantizes when necessary. Let's use the segments in Figure 4 as an example. Let's give segments $$A$$ and $$B$$ $$1,000$$ documents each and segment $$C$$ only $$100$$ documents. Lucene will take a weighted average of the quantiles and if that resulting merged quantile is near enough to the segments original quantiles, we don't have to re-quantize that segment and will utilize the newly merged quantiles.
<cite> Figure 5: Example of merged quantiles where segments $$A$$ and $$B$$ have $$1000$$ documents and $$C$$ only has $$100$$. </cite>
In the situation visualized in figure 5, we can see that the resulting merged quantiles are very similar to the original quantiles in $$A$$ and $$B$$. Thus, they do not justify quantizing the vectors. Segment $$C$$, seems to deviate too much. Consequently, the vectors in $$C$$ would get re-quantized with the newly merged quantile values.
There are indeed extreme cases where the merged quantiles differ dramatically from any of the original quantiles. In this case, we will take a sample from each segment and fully re-calculate the quantiles.
So, is it fast and does it still provide good recall? The following numbers were gathered running the experiment on a c3-standard-8
GCP instance. To ensure a fair comparison with $$float32$$ we used an instance large enough to hold raw vectors in memory. We indexed $$400,000$$ Cohere Wiki vectors using maximum-inner-product.
<cite> Figure 6: Recall@10 for quantized vectors vs raw vectors. The search performance of quantized vectors is significantly faster than raw, and recall is quickly recoverable by gathering just 5 more vectors; visible by $$quantized@15$$. </cite>
Figure 6 shows the story. While there is a recall difference, as to be expected, it's not significant. And, the recall difference dissappears by gathering just 5 more vectors. All this with $$2\times$$ faster segment merges and 1/4 of the memory of $$float32$$ vectors.
Lucene provides a unique solution to a difficult problem. There is no “training” or “optimization” step required for quantization. In Lucene, it will just work. There is no worry about having to “re-train” your vector index if your data shifts. Lucene will detect significant changes and take care of this automatically over the lifetime of your data. Look forward to when we bring this capability into Elasticsearch!
]]>Scalar quantization takes each vector dimension and buckets them into some smaller data type. For the rest of the blog, we will assume quantizing $float32$ values into $int8$. To bucket values accurately, it isn't as simple as rounding the floating point values to the nearest integer. Many models output vectors that have dimensions continuously on the range $[-1.0, 1.0]$. So, two different vector values 0.123 and 0.321 could both be rounded down to 0. Ultimately, a vector would only use 2 of its 255 available buckets in $int8$, losing too much information.
<cite> Figure 1: Illustration of quantization goals, bucketing continuous values from $-1.0$ to $1.0$ into discrete $int8$ values. </cite>The math behind the numerical transformation isn't too complicated. Since we can calculate the minimum and maximum values for the floating point range, we can use min-max normalization and then linearly shift the values.
int8 \approx \frac{127}{max - min} \times (float32 - min)
float32 \approx \frac{max - min}{127} \times int8 + min
<cite>
Figure 2: Equations for transforming between $int8$ and $float32$. Note, these are lossy transformations and not exact. In the following examples, we are only using positive values within int8. This aligns with the Lucene implementation.
</cite>
A quantile is a slice of a distribution that contains a certain percentage of the values. So, for example, it may be that $99%$ of our floating point values are between $[-0.75, 0.86]$ instead of the true minimum and maximum values of $[-1.0, 1.0]$. Any values less than -0.75 and greater than 0.86 are considered outliers. If you include outliers when attempting to quantize results, you will have fewer available buckets for your most common values. And fewer buckets can mean less accuracy and thus greater loss of information.
<cite> Figure 3: Illustration of the $99%$ confidence interval and the individual quantile values. $99%%$ of all values fall within the range $[-0.75, 0.86]$. </cite>
This is all well and good, but now that we know how to quantize values, how can we actually calculate distances between two quantized vectors? Is it as simple as a regular dot_product?
We are still missing one vital piece, how do we calculate the distance between two quantized vectors. While we haven't shied away from math yet in this blog, we are about to do a bunch more. Time to break out your pencils and try to remember polynomials and basic algebra.
The basic requirement for dot_product and cosine similarity is being able to multiply floating point values together and sum up their results. We already know how to transform between $float32$ and $int8$ values, so what does multiplication look like with our transformations?
float32_i \times float32'_i \approx (\frac{max - min}{127} \times int8_i + min) \times (\frac{max - min}{127} \times int8'_i + min)
We can then expand this multiplication and to simplify we will substitute $$\alpha$$ for $$\frac{max - min}{127}$$.
\alpha^2 \times int8_i \times int8'_i + \alpha \times int8_i \times min + \alpha \times int8'_i \times min + min^2
What makes this even more interesting, is that only one part of this equation requires both values at the same time. However, dot_product isn't just two floats being multiplied, but all the floats for each dimension of the vector. With vector dimension count $$dim$$ in hand, all the following can be pre-calculated at query time and storage time.
$$dim\times\alpha^2$$ is just $$dim\times(\frac{max-min}{127})^2$$ and can be stored as a single float value.
$$\sum_{i=0}^{dim-1}min\times\alpha\times int8_i$$ and $$\sum_{i=0}^{dim-1}min\times\alpha\times int8'_i$$ can be pre-calculated and stored as a single float value or calculated once at query time.
$$dim\times min^2$$ can be pre-calculated and stored as a single float value.
Of all this:
dim \times \alpha^2 \times dotProduct(int8, int8') + \sum_{i=0}^{dim-1}min\times\alpha\times int8_i + \sum_{i=0}^{dim-1}min\times\alpha\times int8'_i + dim\times min^2
The only calculation required for dot_product is just $$dotProduct(int8, int8')$$ with some pre-calculated values combined with the result.
So, how is this accurate at all? Aren't we losing information by quantizing? Yes, we are, but quantization takes advantage of the fact that we don't need all the information. For learned embeddings models, the distributions of the various dimensions usually don't have fat-tails. This means they are localized and fairly consistent. Additionaly, the error introduced per dimension via quantization is independent. Meaning, the error cancels out for our typical vector operations like dot_product.
Whew, that was a ton to cover. But now you have a good grasp of the technical benefits of quantization, the math behind it, and how you can calculate the distances between vectors while accounting for the linear transformation. Look next at how we implemented this in Lucene and some of the unique challenges and benefits available there.
]]>When indexing data, Elasticsearch starts building new segments in memory and writes index operations into a transaction log for durability. These in-memory segments eventually get serialized to disk, either when changes need to be made visible - an operation called "refresh" in Elasticsearch, or when memory needs to be reclaimed. This blog focuses on the latter.
To manage memory of its indexing buffer, Elasticsearch keeps track of how much RAM is used across all shards on the local node. Whenever this amount of memory crosses the limit (10% of the heap size by default), it will identify the shard that uses the most memory and refresh it.
When changes get buffered in memory for a given shard, there is not a single pending segment. In order to be able to index concurrently, Lucene maintains a pool of pending segments. When a thread wants to index a new document, it picks a pending segment from this pool, updates it and then moves the pending segment back to the pool. If there is no free pending segment in the pool, a new one will be created. There are usually a number of pending segments in the pool which are in the order of the peak indexing concurrency.
The first change we applied was to update this logic to no longer flush all segments from a shard at once, and instead only flush the largest pending segment using Lucene's IndexWriter#flushNextBuffer()
API. This helps because the size of pending segments is generally not uniform, as Lucene has a bias towards updating the largest pending segments, so this new approach helps flush fewer segments, which should also be significantly larger. And since fewer segments get merged, less merging is required to keep the number of segments under control.
Managing a shared indexing buffer across many shards is a hard problem. The existing logic assumed that it would be sensible to pick the shard that uses the most memory for its indexing buffer as the next shard to reclaim memory from. After all, this is the most efficient way to buy some time until we reach the maximum amount of memory for the indexing buffer again. But on the other hand, this also penalizes shards that ingest most actively, as they would flush segments much more frequently than shards that have a modest ingestion rate. There are many moving parts here, which make it hard to get a good intuition about how these various factors play together and figure out the best strategy for picking the next shard to flush.
So we ran experiments with various approaches to picking the next shard to flush, and interestingly picking the largest shard was the worst one, significantly outperformed by picking shards at random. Actually, the only approach that slightly outperformed picking shards at random was to pick shards in a round-robin fashion. This is now the way how Elasticsearch picks the next shard to flush.
These two changes should help reduce the merging overhead and speed up ingestion, especially with small heaps and field types that consume significant amounts of RAM in the indexing buffer like text
and match_only_text
fields, or field types that are expensive to merge like dense_vector
. Enjoy the speedups!
dot_product
to be only used over normalized vectors. Normalization forces all vector magnitudes to equal one. While for many cases this is acceptable, it can cause relevancy issues for certain data sets. A prime example are embeddings built by Cohere. Their vectors use magnitudes to provide more relevant information.
So, why not allow non-normalized vectors in dot-product and thus enable maximum-inner-product? What's the big deal?
Lucene requires non-negative scores, so that matching one more clause in a disjunctive query can only make the score greater, not lower. This is actually important for dynamic pruning optimizations such as block-max WAND, whose efficiency is largely defeated if some clauses may produce negative scores. How does this requirement affect non-normalized vectors?
In the normalized case, all vectors are on a unit sphere. This allows handling negative scores to be simple scaling.
<cite> Figure 1: Two opposite, two dimensional vectors in a 2d unit sphere (e.g. a unit circle). When calculating the dot-product here, the worst it can be is -1 = [1, 0] \* [-1, 0]. Lucene accounts for this by adding 1 to the result. </cite>With vectors retaining their magnitude, the range of possible values is unknown.
<cite> Figure 2: When calculating the dot-product for these vectors `[2, 2] \* [-5, -5] = -20` </cite>To allow Lucene to utilize blockMax WAND with non-normalized vectors, we must scale the scores. This is a fairly simple solution. Lucene will scale non-normalize vectors with a simple piecewise function:
if (dotProduct < 0) {
return 1 / (1 + -1 * dotProduct);
}
return dotProduct + 1;
Now all negative scores are between 0-1, and all positives are scaled above 1. This still ensures that higher values mean better matches and removes negative scores. Simple enough, but this is not the final hurdle.
Maximum-inner-product doesn't follow the same rules as of simple euclidean spaces. The simple assumed knowledge of the triangle inequality is abandoned. Unintuitively, a vector is no longer nearest to itself. This can be troubling. Lucene’s underlying index structure for vectors is Hierarchical Navigable Small World (HNSW). This being a graph based algorithm, it might rely on euclidean space assumptions. Or would exploring the graph be too slow in non-euclidean space?
Some research has indicated that a transformation into euclidean space is required for fast search. Others have gone through the trouble of updating their vector storage enforcing transformations into euclidean space.
This caused us to pause and dig deep into some data. The key question is this: does HNSW provide good recall and latency with maximum-inner-product search? While the original HNSW paper and other published research indicate that it does, we needed to do our due diligence.
The experiments we ran were simple. All of the experiments are over real data sets or slightly modified real data sets. This is vital for benchmarking as modern neural networks create vectors that adhere to specific characteristics (see discussion in section 7.8 of this paper). We measured latency (in milliseconds) vs. recall over non-normalized vectors. Comparing the numbers with the same measurements but with a euclidean space transformation. In each case, the vectors were indexed into Lucene’s HNSW implementation and we measured for 1000 iterations of queries. Three individual cases were considered for each dataset: data inserted ordered by magnitude (lesser to greater), data inserted in a random order, and data inserted in reverse order (greater to lesser).
Here are some results from real datasets from Cohere:
<ImageSet> ![wikien-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image2.png) ![wikien-ordered](/assets/images/lucene-bringing-max-inner-product-to-lucene/image9.png) ![wikien-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image3.png) </ImageSet> <cite> Figure 3: Here are results for the Cohere’s Multilingual model embedding wikipedia articles. [Available on HuggingFace](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings). The first 100k documents were indexed and tested. </cite> <ImageSet> ![wikienja-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image13.png) ![wikienja-ordered-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image6.png) ![wikienja-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image5.png) </ImageSet> <cite> Figure 4: This is a mixture of Cohere’s English and Japanese embeddings over wikipedia. [Both](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings) [datasets](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings) are available on HuggingFace. </cite>We also tested against some synthetic datasets to ensure our rigor. We created a data set with e5-small-v2 and scaled the vector's magnitudes by different statistical distributions. For brevity, I will only show two distributions.
<ImageSet> ![pareto-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image14.png) ![pareto-ordered-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image4.png) ![pareto-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image10.png) </ImageSet> <cite> Figure 5: [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution) of magnitudes. A pareto distribution has a “fat tail” meaning there is a portion of the distribution with a much larger magnitude than others. </cite> <ImageSet> ![gamma-1-1-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image1.png) ![gamma-1-1-ordered-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image11.png) ![gamma-1-1-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image8.png) </ImageSet> <cite> Figure 6: [Gamma distribution](https://en.wikipedia.org/wiki/Gamma_distribution) of magnitudes. This distribution can have high variance and makes it unique in our experiments. </cite>In all our experiments, the only time where the transformation seemed warranted was the synthetic dataset created with the gamma distribution. Even then, the vectors must be inserted in reverse order, largest magnitudes first, to justify the transformation. These are exceptional cases.
If you want to read about all the experiments, and about all the mistakes and improvements along the way, here is the Lucene Github issue with all the details (and mistakes along the way). Here’s one for open research and development!
This has been quite a journey requiring many investigations to make sure maximum-inner-product can be supported in Lucene. We believe the data speaks for itself. No significant transformations required or significant changes to Lucene. All this work will soon unlock maximum-inner-product support with Elasticsearch and allow models like the ones provided by Cohere to be first class citizens in the Elastic Stack.
]]>Where lexical search like BM25 is already designed for long documents, text embedding models are not. All embedding models have limitations on the number of tokens they can embed. So, for longer text input it must be chunked into passages shorter than the model’s limit. Now instead of having one document with all its metadata, you have multiple passages and embeddings. And if you want to preserve your metadata, it must be added to every new document.
Figure 1: Now instead of having a single piece of metadata indicating the first chapter of Little Women, you have to index that information data for every sentence.
A way to address this is with Lucene's “join” functionality. This is an integral part of Elasticsearch’s nested field type. It makes it possible to have a top-level document with multiple nested documents, allowing you to search over nested documents and join back against their parent documents. This sounds perfect for multiple passages and vectors belonging to a single top-level document! This is all awesome! But, wait, Elasticsearch® doesn’t support vectors in nested fields. Why not, and what needs to change?
The key issue is how Lucene can join back to the parent documents when searching child vector passages. Like with kNN pre-filtering versus post-filtering, when the joining occurs determines the result quality and quantity. If a user searches for the top four nearest parent documents (not passages) to a query vector, they usually expect four documents. But what if they are searching over child vector passages and all four of the nearest vectors are from the same parent document? This would end up returning just one parent document, which would be surprising. This same kind of issue occurs with post-filtering.
Figure 2: Documents 3, 5, 10 are parent docs. 1, 2 belong to 3; 4 to 5; 6, 7, 8, 9 to 10.
Let us search with query vector A, and the four nearest passage vectors are 6, 7, 8, 9. With “post-joining,” you only end up retrieving parent document 10.
Figure 3: Vector “A” matching nearest all the children of 10.
What can we do about this problem? One answer could be, “Just increase the number of vectors returned!” However, at scale, this is untenable. What if every parent has at least 100 children and you want the top 1,000 nearest neighbors? That means you have to search for at least 100,000 children! This gets out of hand quickly. So, what’s another solution?
The solution to the “post-joining” problem is “pre-joining.” Recently added changes to Lucene enable joining against the parent document while searching the HNSW graph! Like with kNN pre-filtering, this ensures that when asked to find the k nearest neighbors of a query vector, we can return not the k nearest passages as represented by dense vectors, but k nearest documents, as represented by their child passages that are most similar to the query vector. What does this actually look like in practice?
Let’s assume we are searching the same nested documents as before:
Figure 4: Documents 3, 5, 10 are parent docs. 1,2 belong to 3; 4 to 5; 6, 7, 8, 9 to 10.
As we search and score documents, instead of tracking children, we track the parent documents and update their scores. Figure 5 shows a simple flow. For each child document visited, we get its score and then track it by its parent document ID. This way, as we search and score the vectors we only gather the parent IDs. This ensures diversification of results with no added complexity to the HNSW algorithm using already existing and powerful tools within Lucene. All this with only a single additional bit of memory required per vector stored.
<Video vidyardUuid="ggtoAkVmb7uvDY4T38GUn6" loop={true} hideControls={true} muted={true} />
Figure 5: As we search the vectors, we score and collect the associated parent document. Only updating the score if it is more competitive than the previous.
But, how is this efficient? Glad you asked! There are certain restrictions that provide some really nice short cuts. As you can tell from the previous examples, all parent document IDs are larger than child IDs. Additionally, parent documents do not contain vectors themselves, meaning children and parents are purely disjoint sets. This affords some nice optimizations via bit sets. A bit set provides an exceptionally fast structure for “tell me the next bit that is set.” For any child document, we can ask the bit set, “Hey, what's the number that is larger than me that is in the set?” Since the sets are disjoint, we know the next bit that is set is the parent document ID.
In this post, we explored both the challenges of supporting dense document retrieval at scale and our proposed solution using nested fields and joins in Lucene. This work in Lucene paves the way to more naturally storing and searching dense vectors of passages from long text in documents and an overall improvement in document modeling for vector search in Elasticsearch. This is a very exciting step forward for vector search in Elasticsearch!
If you want to chat about this or anything else related to vector search in Elasticsearch, come join us in our Discuss forum.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.
Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.
]]>