kNN search in Elasticsearch is organized as a top level section of a search request. We have designed it this way so that:

- It can always return global k nearest neighbors regardless of a number of shards
- These global k results are combined with a results from other queries to form a hybrid search
- The global k results are passed to aggregations to form facets.

Here is a simplified diagram how kNN search is executed internally (some phases are omitted) :

<cite> Figure 1: The steps for the top level kNN search are:

- A user submits a search request
- The coordinator node sends a kNN search part of the request to data nodes in the DFS phase
- Each data node runs kNN search and sends back the local top-k results to the coordinator
- The coordinator merges all local results to form the global top k nearest neighbors.
- The coordinator sends back the global k nearest neighbors to the data nodes with any additional queries provided
- Each data node runs additional queries and sends back the local
`size`

results to the coordinator - The coordinator merges all local results and sends a response to the user </cite>

We first run kNN search in the DFS phase to obtain the global top k results. These global k results are then passed to other parts of the search request, such as other queries or aggregations. Even the execution looks complex, from a user’s perspective this model of running kNN search is simple, as the user can always be sure that kNN search returns the global k results.

With time we realized there is also a need to represent kNN search as a query. Query is a core component of a search request in Elasticsearch, and representing kNN search as a query allows for flexibility to combine it with other queries to address more complex requests.

kNN query, unlike the top level kNN search, doesn’t have a `k`

parameter. The number of results (nearest neighbors) returned is defined by the `size`

parameter, as in other queries. Similar to kNN search, the `num_candidates`

parameter defines how many candidates to consider on each shard while executing a kNN search.

```
GET products/_search
{
"size" : 3,
"query": {
"knn": {
"field": "embedding",
"query_vector": [2,2,2,0],
"num_candidates": 10
}
}
}
```

kNN query is executed differently from the top level kNN search. Here is a simplified diagram that describes how a kNN query is executed internally (some phases are omitted):

<cite> Figure 2: The steps for query based kNN search are:

- A user submits a search request
- The coordinator sends to the data nodes a kNN search query with additional queries provided
- Each data node runs the query and sends back the local size results to the coordinator node
- The coordinator node merges all local results and sends a response to the user </cite>

We run kNN search on a shard to get `num_candidates`

results; these results are passed to other queries and aggregations on a shard to get size results from the shard. As we don’t collect the global k nearest neighbors first, in this model the number of nearest neighbors collected and visible for other queries and aggregations depend on the number of shards.

Let’s look at API examples that demonstrate differences between the top level kNN search and kNN query.

We create an index of products and index some documents:

```
PUT products
{
"mappings": {
"dynamic": "strict",
"properties": {
"department": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"description": {
"type": "text"
},
"embedding": {
"type": "dense_vector",
"index": true,
"similarity": "l2_norm"
},
"price": {
"type": "float"
}
}
}
}
```

```
POST products/_bulk?refresh=true
{"index":{"_id":1}}
{"department":"women","brand": "Levi's", "description":"high-rise red jeans","embedding":[1,1,1,1],"price":100}
{"index":{"_id":2}}
{"department":"women","brand": "Calvin Klein","description":"high-rise beautiful jeans","embedding":[1,1,1,1],"price":250}
{"index":{"_id":3}}
{"department":"women","brand": "Gap","description":"every day jeans","embedding":[1,1,1,1],"price":50}
{"index":{"_id":4}}
{"department":"women","brand": "Levi's","description":"jeans","embedding":[2,2,2,0],"price":75}
{"index":{"_id":5}}
{"department":"women","brand": "Levi's","description":"luxury jeans","embedding":[2,2,2,0],"price":150}
{"index":{"_id":6}}
{"department":"men","brand": "Levi's", "description":"jeans","embedding":[2,2,2,0],"price":50}
{"index":{"_id":7}}
{"department":"women","brand": "Levi's", "description":"jeans 2023","embedding":[2,2,2,0],"price":150}
```

kNN query similar to the top level kNN search, has `num_candidates`

and an internal `filter`

parameter that acts as a pre-filter.

```
GET products/_search
{
"size" : 3,
"query": {
"knn": {
"field": "embedding",
"query_vector": [2,2,2,0],
"num_candidates": 10,
"filter" : {
"term" : {
"department" : "women"
}
}
}
}
}
```

kNN query can get more diverse results than kNN search for collapsing and aggregations. For the kNN query below, on each shard we execute kNN search to obtain 10 nearest neighbors which are then passed to collapse to get 3 top results. Thus, we will get 3 diverse hits in a response.

```
GET products/_search
{
"size" : 3,
"query": {
"knn": {
"field": "embedding",
"query_vector": [2,2,2,0],
"num_candidates": 10,
"filter" : {
"term" : {
"department" : "women"
}
}
}
},
"collapse": {
"field": "brand"
}
}
```

The top level kNN search first gets the global top 3 results in the DFS phase, and then passes them to collapse in the query phase. We will get only 1 hit in a response, as all the global 3 nearest neighbors happened to be from the same brand.

```
GET products/_search?size=3
{
"knn" : {
"field": "embedding",
"query_vector": [2,2,2,0],
"k" : 3,
"num_candidates": 10,
"filter" : {
"term" : {
"department" : "women"
}
}
},
"collapse": {
"field": "brand"
}
}
```

Similarly for aggregations, a kNN query allows us to get 3 distinct buckets, while kNN search only allows 1.

```
GET products/_search
{
"size": 0,
"query": {
"knn": {
"field": "embedding",
"query_vector": [2,2,2,0],
"num_candidates": 10,
"filter" : {
"term" : {
"department" : "women"
}
}
}
},
"aggs": {
"brands": {
"terms": {
"field": "brand"
}
}
}
}
GET products/_search
{
"size": 0,
"knn" : {
"field": "embedding",
"query_vector": [2,2,2,0],
"k" : 3,
"num_candidates": 10,
"filter" : {
"term" : {
"department" : "women"
}
}
},
"aggs": {
"brands": {
"terms": {
"field": "brand"
}
}
}
}
```

Now, let’s look at other examples that show the flexibility of the kNN query. Specifically, how it can be flexibly combined with other queries.

kNN can be a part of a boolean query (with a caveat that all external query filters are applied as post-filters for kNN search). We can use a _name parameter for kNN query to enhance results with extra information that tells if the kNN query was a match and its score contribution.

```
GET products/_search?include_named_queries_score
{
"size": 3,
"query": {
"bool": {
"should": [
{
"knn": {
"field": "embedding",
"query_vector": [2,2,2,0],
"num_candidates": 10,
"_name": "knn_query"
}
},
{
"match": {
"description": {
"query": "luxury",
"_name": "bm25query"
}
}
}
]
}
}
}
```

kNN can also be a part of complex queries, such as a pinned query. This is useful when we want to display the top nearest results, but also want to promote a selected number of other results.

```
GET products/_search
{
"size": 3,
"query": {
"pinned": {
"ids": [ "1", "2" ],
"organic": {
"knn": {
"field": "embedding",
"query_vector": [2,2,2,0],
"num_candidates": 10,
"_name": "knn_query"
}
}
}
}
}
```

We can even make the kNN query a part of our function_score query. This is useful when we need to define custom scores for results returned by kNN query:

```
GET products/_search
{
"size": 3,
"query": {
"function_score": {
"query": {
"knn": {
"field": "embedding",
"query_vector": [2,2,2,0],
"num_candidates": 10,
"_name": "knn_query"
}
},
"functions": [
{
"filter": { "match": { "department": "men" } },
"weight": 100
},
{
"filter": { "match": { "department": "women" } },
"weight": 50
}
]
}
}
}
```

kNN query being a part of dis_max query is useful when we want to combine results from kNN search and other queries, so that a document’s score comes from the highest ranked clause with a tie breaking increment for any additional clause.

```
GET products/_search
{
"size": 5,
"query": {
"dis_max": {
"queries": [
{
"knn": {
"field": "embedding",
"query_vector": [2,2, 2,0],
"num_candidates": 3,
"_name": "knn_query"
}
},
{
"match": {
"description": "high-rise jeans"
}
}
],
"tie_breaker": 0.8
}
}
}
```

kNN search as a query has been introduced with the 8.12 release. Please try it out, and we would appreciate any feedback.

]]>While HNSW is a powerful and flexible way to store and search vectors, it does require a significant amount of memory to run quickly. For example, querying 1MM float32 vectors of 768 dimensions requires roughly $$1,000,000 * 4 * (768 + 12) = 3120000000 bytes \approx 3GB$$ of ram. Once you start searching a significant number of vectors, this gets expensive. One way to use around $$75%$$ less memory is through byte quantization. Lucene and consequently Elasticsearch has supported indexing $$byte$$ vectors for some time, but building these vectors has been the user's responsibility. This is about to change, as we have introduced $$int8$$ scalar quantization in Lucene.

All quantization techniques are considered lossy transformations of the raw data. Meaning some information is lost for the sake of space. For an in depth explanation of scalar quantization, see: Scalar Quantization 101. At a high level, scalar quantization is a lossy compression technique. Some simple math gives significant space savings with very little impact on recall.

Those used to working with Elasticsearch may be familiar with these concepts already, but here is a quick overview of the distribution of documents for search.

Each Elasticsearch index is composed of multiple shards. While each shard can only be assigned to a single node, multiple shards per index gives you compute parallelism across nodes.

Each shard is composed as a single Lucene Index. A Lucene index consists of multiple read-only segments. During indexing, documents are buffered and periodically flushed into a read-only segment. When certain conditions are met, these segments can be merged in the background into a larger segment. All of this is configurable and has its own set of complexities. But, when we talk about segments and merging, we are talking about read-only Lucene segments and the automatic periodic merging of these segments. Here is a deeper dive into segment merging and design decisions.

Every segment in Lucene stores the following: the individual vectors, the HNSW graph indices, the quantized vectors, and the calculated quantiles. For brevity's sake, we will focus on how Lucene stores quantized and raw vectors. For every segment, we keep track of the raw vectors in the $$vec$$ file, quantized vectors and a single corrective multiplier float in $$veq$$, and the metadata around the quantization within the $$vemq$$ file.

<cite> Figure 1: Simplified layout of raw vector storage file. Takes up $$dimension * 4 * numVectors$$ of disk space since $$float$$ values are 4 bytes. Because we are quantizing, these will not get loaded during HNSW search. They are only used if specifically requested (e.g. brute-force secondary via rescore), or for re-quantization during segment merge. </cite>

<cite> Figure 2: Simplified layout of the $$.veq$$ file. Takes up $$(dimension + 4)*numVectors$$ of space and will be loaded into memory during search. The $$+ 4$$ bytes is to account for the corrective multiplier float, used to adjust scoring for better accuracy and recall. </cite>

<cite> Figure 3: The simplified layout of the metadata file. Here is where we keep track of quantization and vector configuration along with the calculated quantiles for this segment. </cite>

So, for each segment, we store not only the quantized vectors, but the quantiles used in making these quantized vectors and the original raw vectors. But, why do we keep the raw vectors around at all?

Since Lucene periodically flushes to read only segments, each segment only has a partial view of all your data. This means the quantiles calculated only directly apply for that sample set of your entire data. Now, this isn't a big deal if your sample adequately represents your entire corpus. But Lucene allows you to sort your index in various ways. So, you could be indexing data sorted in a way that adds bias for per-segment quantile calculations. Also, you can flush the data whenever you like! Your sample set could be tiny, even just one vector. Yet another wrench is that you have control over when merges occur. While Elasticsearch has configured defaults and periodic merging, you can ask for a merge whenever you like via _force_merge API. So how do we still allow all this flexibility, while providing good quantization that provides good recall?

Lucene's vector quantization will automatically adjust over time. Because Lucene is designed with a read-only segment architecture, we have guarantees that the data in each segment hasn't changed and clear demarcations in the code for when things can be updated. This means during segment merge we can adjust quantiles as necessary and possibly re-quantize vectors.

<cite> Figure 4: Three example segments with different quantiles. </cite>

But isn't re-quantization expensive? It does have some overhead, but Lucene handles quantiles intelligently, and only fully-requantizes when necessary. Let's use the segments in Figure 4 as an example. Let's give segments $$A$$ and $$B$$ $$1,000$$ documents each and segment $$C$$ only $$100$$ documents. Lucene will take a weighted average of the quantiles and if that resulting merged quantile is near enough to the segments original quantiles, we don't have to re-quantize that segment and will utilize the newly merged quantiles.

<cite> Figure 5: Example of merged quantiles where segments $$A$$ and $$B$$ have $$1000$$ documents and $$C$$ only has $$100$$. </cite>

In the situation visualized in figure 5, we can see that the resulting merged quantiles are very similar to the original quantiles in $$A$$ and $$B$$. Thus, they do not justify quantizing the vectors. Segment $$C$$, seems to deviate too much. Consequently, the vectors in $$C$$ would get re-quantized with the newly merged quantile values.

There are indeed extreme cases where the merged quantiles differ dramatically from any of the original quantiles. In this case, we will take a sample from each segment and fully re-calculate the quantiles.

So, is it fast and does it still provide good recall? The following numbers were gathered running the experiment on a `c3-standard-8`

GCP instance. To ensure a fair comparison with $$float32$$ we used an instance large enough to hold raw vectors in memory. We indexed $$400,000$$ Cohere Wiki vectors using maximum-inner-product.

<cite> Figure 6: Recall@10 for quantized vectors vs raw vectors. The search performance of quantized vectors is significantly faster than raw, and recall is quickly recoverable by gathering just 5 more vectors; visible by $$quantized@15$$. </cite>

Figure 6 shows the story. While there is a recall difference, as to be expected, it's not significant. And, the recall difference dissappears by gathering just 5 more vectors. All this with $$2\times$$ faster segment merges and 1/4 of the memory of $$float32$$ vectors.

Lucene provides a unique solution to a difficult problem. There is no “training” or “optimization” step required for quantization. In Lucene, it will just work. There is no worry about having to “re-train” your vector index if your data shifts. Lucene will detect significant changes and take care of this automatically over the lifetime of your data. Look forward to when we bring this capability into Elasticsearch!

]]>Scalar quantization takes each vector dimension and buckets them into some smaller data type. For the rest of the blog, we will assume quantizing $float32$ values into $int8$. To bucket values accurately, it isn't as simple as rounding the floating point values to the nearest integer. Many models output vectors that have dimensions continuously on the range $[-1.0, 1.0]$. So, two different vector values 0.123 and 0.321 could both be rounded down to 0. Ultimately, a vector would only use 2 of its 255 available buckets in $int8$, losing too much information.

<cite> Figure 1: Illustration of quantization goals, bucketing continuous values from $-1.0$ to $1.0$ into discrete $int8$ values. </cite>The math behind the numerical transformation isn't too complicated. Since we can calculate the minimum and maximum values for the floating point range, we can use min-max normalization and then linearly shift the values.

```
int8 \approx \frac{127}{max - min} \times (float32 - min)
```

```
float32 \approx \frac{max - min}{127} \times int8 + min
```

<cite>
Figure 2: Equations for transforming between $int8$ and $float32$. Note, these are lossy transformations and not exact. In the following examples, we are only using positive values within int8. This aligns with the Lucene implementation.
</cite>
A quantile is a slice of a distribution that contains a certain percentage of the values. So, for example, it may be that $99%$ of our floating point values are between $[-0.75, 0.86]$ instead of the true minimum and maximum values of $[-1.0, 1.0]$. Any values less than -0.75 and greater than 0.86 are considered outliers. If you include outliers when attempting to quantize results, you will have fewer available buckets for your most common values. And fewer buckets can mean less accuracy and thus greater loss of information.

<cite> Figure 3: Illustration of the $99%$ confidence interval and the individual quantile values. $99%%$ of all values fall within the range $[-0.75, 0.86]$. </cite>

This is all well and good, but now that we know how to quantize values, how can we actually calculate distances between two quantized vectors? Is it as simple as a regular dot_product?

We are still missing one vital piece, how do we calculate the distance between two quantized vectors. While we haven't shied away from math yet in this blog, we are about to do a bunch more. Time to break out your pencils and try to remember polynomials and basic algebra.

The basic requirement for dot_product and cosine similarity is being able to multiply floating point values together and sum up their results. We already know how to transform between $float32$ and $int8$ values, so what does multiplication look like with our transformations?

```
float32_i \times float32'_i \approx (\frac{max - min}{127} \times int8_i + min) \times (\frac{max - min}{127} \times int8'_i + min)
```

We can then expand this multiplication and to simplify we will substitute $$\alpha$$ for $$\frac{max - min}{127}$$.

```
\alpha^2 \times int8_i \times int8'_i + \alpha \times int8_i \times min + \alpha \times int8'_i \times min + min^2
```

What makes this even more interesting, is that only one part of this equation requires both values at the same time. However, dot_product isn't just two floats being multiplied, but all the floats for each dimension of the vector. With vector dimension count $$dim$$ in hand, all the following can be pre-calculated at query time and storage time.

$$dim\times\alpha^2$$ is just $$dim\times(\frac{max-min}{127})^2$$ and can be stored as a single float value.

$$\sum_{i=0}^{dim-1}min\times\alpha\times int8_i$$ and $$\sum_{i=0}^{dim-1}min\times\alpha\times int8'_i$$ can be pre-calculated and stored as a single float value or calculated once at query time.

$$dim\times min^2$$ can be pre-calculated and stored as a single float value.

Of all this:

```
dim \times \alpha^2 \times dotProduct(int8, int8') + \sum_{i=0}^{dim-1}min\times\alpha\times int8_i + \sum_{i=0}^{dim-1}min\times\alpha\times int8'_i + dim\times min^2
```

The only calculation required for dot_product is just $$dotProduct(int8, int8')$$ with some pre-calculated values combined with the result.

So, how is this accurate at all? Aren't we losing information by quantizing? Yes, we are, but quantization takes advantage of the fact that we don't need all the information. For learned embeddings models, the distributions of the various dimensions usually don't have fat-tails. This means they are localized and fairly consistent. Additionaly, the error introduced per dimension via quantization is independent. Meaning, the error cancels out for our typical vector operations like dot_product.

Whew, that was a ton to cover. But now you have a good grasp of the technical benefits of quantization, the math behind it, and how you can calculate the distances between vectors while accounting for the linear transformation. Look next at how we implemented this in Lucene and some of the unique challenges and benefits available there.

]]>`dot_product`

to be only used over normalized vectors. Normalization forces all vector magnitudes to equal one. While for many cases this is acceptable, it can cause relevancy issues for certain data sets. A prime example are embeddings built by Cohere. Their vectors use magnitudes to provide more relevant information.
So, why not allow non-normalized vectors in dot-product and thus enable maximum-inner-product? What's the big deal?

Lucene requires non-negative scores, so that matching one more clause in a disjunctive query can only make the score greater, not lower. This is actually important for dynamic pruning optimizations such as block-max WAND, whose efficiency is largely defeated if some clauses may produce negative scores. How does this requirement affect non-normalized vectors?

In the normalized case, all vectors are on a unit sphere. This allows handling negative scores to be simple scaling.

<cite> Figure 1: Two opposite, two dimensional vectors in a 2d unit sphere (e.g. a unit circle). When calculating the dot-product here, the worst it can be is -1 = [1, 0] \* [-1, 0]. Lucene accounts for this by adding 1 to the result. </cite>With vectors retaining their magnitude, the range of possible values is unknown.

<cite> Figure 2: When calculating the dot-product for these vectors `[2, 2] \* [-5, -5] = -20` </cite>To allow Lucene to utilize blockMax WAND with non-normalized vectors, we must scale the scores. This is a fairly simple solution. Lucene will scale non-normalize vectors with a simple piecewise function:

```
if (dotProduct < 0) {
return 1 / (1 + -1 * dotProduct);
}
return dotProduct + 1;
```

Now all negative scores are between 0-1, and all positives are scaled above 1. This still ensures that higher values mean better matches and removes negative scores. Simple enough, but this is not the final hurdle.

Maximum-inner-product doesn't follow the same rules as of simple euclidean spaces. The simple assumed knowledge of the triangle inequality is abandoned. Unintuitively, a vector is no longer nearest to itself. This can be troubling. Lucene’s underlying index structure for vectors is Hierarchical Navigable Small World (HNSW). This being a graph based algorithm, it might rely on euclidean space assumptions. Or would exploring the graph be too slow in non-euclidean space?

Some research has indicated that a transformation into euclidean space is required for fast search. Others have gone through the trouble of updating their vector storage enforcing transformations into euclidean space.

This caused us to pause and dig deep into some data. The key question is this: does HNSW provide good recall and latency with maximum-inner-product search? While the original HNSW paper and other published research indicate that it does, we needed to do our due diligence.

The experiments we ran were simple. All of the experiments are over real data sets or slightly modified real data sets. This is vital for benchmarking as modern neural networks create vectors that adhere to specific characteristics (see discussion in section 7.8 of this paper). We measured latency (in milliseconds) vs. recall over non-normalized vectors. Comparing the numbers with the same measurements but with a euclidean space transformation. In each case, the vectors were indexed into Lucene’s HNSW implementation and we measured for 1000 iterations of queries. Three individual cases were considered for each dataset: data inserted ordered by magnitude (lesser to greater), data inserted in a random order, and data inserted in reverse order (greater to lesser).

Here are some results from real datasets from Cohere:

<ImageSet> ![wikien-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image2.png) ![wikien-ordered](/assets/images/lucene-bringing-max-inner-product-to-lucene/image9.png) ![wikien-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image3.png) </ImageSet> <cite> Figure 3: Here are results for the Cohere’s Multilingual model embedding wikipedia articles. [Available on HuggingFace](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings). The first 100k documents were indexed and tested. </cite> <ImageSet> ![wikienja-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image13.png) ![wikienja-ordered-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image6.png) ![wikienja-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image5.png) </ImageSet> <cite> Figure 4: This is a mixture of Cohere’s English and Japanese embeddings over wikipedia. [Both](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings) [datasets](https://huggingface.co/datasets/Cohere/wikipedia-22-12-ja-embeddings) are available on HuggingFace. </cite>We also tested against some synthetic datasets to ensure our rigor. We created a data set with e5-small-v2 and scaled the vector's magnitudes by different statistical distributions. For brevity, I will only show two distributions.

<ImageSet> ![pareto-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image14.png) ![pareto-ordered-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image4.png) ![pareto-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image10.png) </ImageSet> <cite> Figure 5: [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution) of magnitudes. A pareto distribution has a “fat tail” meaning there is a portion of the distribution with a much larger magnitude than others. </cite> <ImageSet> ![gamma-1-1-reversed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image1.png) ![gamma-1-1-ordered-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image11.png) ![gamma-1-1-random-transformed](/assets/images/lucene-bringing-max-inner-product-to-lucene/image8.png) </ImageSet> <cite> Figure 6: [Gamma distribution](https://en.wikipedia.org/wiki/Gamma_distribution) of magnitudes. This distribution can have high variance and makes it unique in our experiments. </cite>In all our experiments, the only time where the transformation seemed warranted was the synthetic dataset created with the gamma distribution. Even then, the vectors must be inserted in reverse order, largest magnitudes first, to justify the transformation. These are exceptional cases.

If you want to read about all the experiments, and about all the mistakes and improvements along the way, here is the Lucene Github issue with all the details (and mistakes along the way). Here’s one for open research and development!

This has been quite a journey requiring many investigations to make sure maximum-inner-product can be supported in Lucene. We believe the data speaks for itself. No significant transformations required or significant changes to Lucene. All this work will soon unlock maximum-inner-product support with Elasticsearch and allow models like the ones provided by Cohere to be first class citizens in the Elastic Stack.

]]>Where lexical search like BM25 is already designed for long documents, text embedding models are not. All embedding models have limitations on the number of tokens they can embed. So, for longer text input it must be chunked into passages shorter than the model’s limit. Now instead of having one document with all its metadata, you have multiple passages and embeddings. And if you want to preserve your metadata, it must be added to every new document.

Figure 1: Now instead of having a single piece of metadata indicating the first chapter of Little Women, you have to index that information data for every sentence.

A way to address this is with Lucene's “join” functionality. This is an integral part of Elasticsearch’s nested field type. It makes it possible to have a top-level document with multiple nested documents, allowing you to search over nested documents and join back against their parent documents. This sounds perfect for multiple passages and vectors belonging to a single top-level document! This is all awesome! But, wait, Elasticsearch® doesn’t support vectors in nested fields. Why not, and what needs to change?

The key issue is how Lucene can join back to the parent documents when searching child vector passages. Like with kNN pre-filtering versus post-filtering, when the joining occurs determines the result quality and quantity. If a user searches for the top four nearest *parent documents (not passages) to a query vector*, they usually expect four documents. But what if they are searching over child vector passages and all four of the nearest vectors are from the same parent document? This would end up returning just *one* parent document, which would be surprising. This same kind of issue occurs with post-filtering.

Figure 2: Documents 3, 5, 10 are parent docs. 1, 2 belong to 3; 4 to 5; 6, 7, 8, 9 to 10.

Let us search with query vector A, and the four nearest passage vectors are 6, 7, 8, 9. With “post-joining,” you only end up retrieving parent document 10.

Figure 3: Vector “A” matching nearest all the children of 10.

What can we do about this problem? One answer could be, “Just increase the number of vectors returned!” However, at scale, this is untenable. What if every parent has at least 100 children and you want the top 1,000 nearest neighbors? That means you have to search for at least 100,000 children! This gets out of hand quickly. So, what’s another solution?

The solution to the “post-joining” problem is “pre-joining.” Recently added changes to Lucene enable joining against the parent document while searching the HNSW graph! Like with kNN pre-filtering, this ensures that when asked to find the k nearest neighbors of a query vector, we can return not the k nearest passages as represented by dense vectors, but k nearest *documents*, as represented by their child passages that are most similar to the query vector. What does this actually look like in practice?

Let’s assume we are searching the same nested documents as before:

Figure 4: Documents 3, 5, 10 are parent docs. 1,2 belong to 3; 4 to 5; 6, 7, 8, 9 to 10.

As we search and score documents, instead of tracking children, we track the parent documents and update their scores. Figure 5 shows a simple flow. For each child document visited, we get its score and then track it by its parent document ID. This way, as we search and score the vectors we only gather the parent IDs. This ensures diversification of results with no added complexity to the HNSW algorithm using already existing and powerful tools within Lucene. All this with only a single additional bit of memory required per vector stored.

<Video vidyardUuid="ggtoAkVmb7uvDY4T38GUn6" loop={true} hideControls={true} muted={true} />

Figure 5: As we search the vectors, we score and collect the associated parent document. Only updating the score if it is more competitive than the previous.

But, how is this efficient? Glad you asked! There are certain restrictions that provide some really nice short cuts. As you can tell from the previous examples, all parent document IDs are larger than child IDs. Additionally, parent documents do not contain vectors themselves, meaning children and parents are purely disjoint sets. This affords some nice optimizations via bit sets. A bit set provides an exceptionally fast structure for “tell me the next bit that is set.” For any child document, we can ask the bit set, “Hey, what's the number that is larger than me that is in the set?” Since the sets are disjoint, we know the next bit that is set is the parent document ID.

In this post, we explored both the challenges of supporting dense document retrieval at scale and our proposed solution using nested fields and joins in Lucene. This work in Lucene paves the way to more naturally storing and searching dense vectors of passages from long text in documents and an overall improvement in document modeling for vector search in Elasticsearch. This is a very exciting step forward for vector search in Elasticsearch!

If you want to chat about this or anything else related to vector search in Elasticsearch, come join us in our Discuss forum.

*The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.*

*In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.*

*Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.*