Introducing approximate nearest neighbor search in Elasticsearch 8.0

blog-thumb-rocket-launch.png

There has been a surge of interest in vector search, thanks to a new generation of machine learning models that can represent all sorts of content as vectors, including text, images, events, and more. Often called “embedding models”, these powerful representations can capture similarity between two pieces of content in a way that goes beyond their surface level characteristics.

k-nearest neighbor (kNN) search algorithms find the vectors in a dataset that are most similar to a query vector. Paired with these vector representations, kNN search opens up exciting possibilities for retrieval:

  • Finding passages likely to contain the answer to a question
  • Detecting near-duplicate images in a large dataset
  • Finding songs that sound similar to a given song

Vector search is poised to become an important component of the search toolbox, alongside traditional techniques like term-based scoring.

Elasticsearch currently supports storing vectors through the dense_vector field type and using them to calculate document scores. This allows users to perform an exact kNN search by scanning all documents. Elasticsearch 8.0 builds on this functionality to support fast, approximate nearest neighbor search (ANN). This represents a much more scalable approach, allowing vector search to run efficiently on large datasets.

ANN in Elasticsearch

What is approximate nearest neighbor search?

There are well-established data structures for kNN on low-dimensional vectors, like KD-trees. In fact, Elasticsearch incorporates KD-trees to support searches on geospatial and numeric data. But modern embedding models for text and images typically produce high-dimensional vectors of 100 - 1000 elements, or even more. These vector representations present a unique challenge, as it’s very difficult to efficiently find nearest neighbors in high dimensions.

Faced with this difficulty, nearest neighbor algorithms usually sacrifice perfect accuracy to improve their speed. These approximate nearest neighbor (ANN) algorithms may not always return the true k nearest vectors. But they run efficiently, scaling to large datasets while maintaining good performance.

Choosing an ANN algorithm

Designing ANN algorithms is an active area of academic research, and there are many promising algorithms to choose from. They often present different trade-offs in terms of their search speed, implementation complexity, and indexing cost. Thankfully there is a great open source project called ann-benchmarks which tests the leading algorithms against several datasets and publishes comparisons.

Elasticsearch 8.0 uses an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW), which organizes vectors into a graph based on their similarity to each other. HNSW shows strong search performance across a variety of ann-benchmarks datasets, and also did well in our own testing. Another benefit of HNSW is that it’s widely used in industry, having been implemented in several different systems. In addition to the original academic paper, there are many helpful resources for learning about the algorithm's details. Although Elasticsearch ANN is currently based on HNSW, the feature is designed in a flexible way to let us incorporate different approaches in the future.

Show me the code!

To index vectors for ANN search, we need to set index: true and specify the similarity metric we’re using to compare them:

PUT index
{
 "mappings": {
   "properties": {
     "image-vector": {
       "type": "dense_vector",
       "dims": 128,
       "index": true,
       "similarity": "l2_norm"
     }
   }
 }
}

PUT index/_doc
{
 "image-vector": [0.12, 1.34, ...]
}
Then, after adding vectors, we can search for the k nearest neighbors to a query vector:
GET index/_knn_search
{
 "knn": {
   "field": "image-vector",
   "query_vector": [-0.5, 9.4, ...],
   "k": 10,
   "num_candidates": 100
 }
}
The new _knn_search endpoint uses HNSW graphs to efficiently retrieve similar vectors. Unlike exact kNN, which performs a full scan of the data, it scales well to large datasets. Here’s an example that compares _knn_search to the exact approach based on script_score queries on a dataset of 1 million image vectors with 128 dimensions, averaging over 10,000 different queries:
Approach         Queries Per Second    Recall (k=10)
script_score           5.257               1.000
_knn_search          849.286               0.945

In this example, ANN search is orders of magnitude faster than the exact approach. Its recall is around 95%, so on average, it finds over 9 out of the 10 true nearest neighbors.

You can check on the performance of kNN search in the Elasticsearch nightly benchmarks. These benchmarks are powered by es-rally, a tool for Elasticsearch benchmarking, specifically the new dense_vector Rally track. We plan to extend Rally to report recall in addition to latency, as it’s also important to track the accuracy of the algorithm. Currently these benchmarks test a dataset of a couple million vectors, but ANN search can certainly scale beyond this with a greater index time or the addition of hardware resources.

Because it is an approximate algorithm, there are special considerations to running ANN compared to other types of search. ANN has both search-time and index-time parameters to control the trade-off between search latency, result accuracy, and indexing cost. It’s important to measure the recall of ANN search on your dataset to make sure the configuration is working well. When jumping into kNN search, the reference guide can be a helpful place to start.

Powered by Apache Lucene

Many of Elasticsearch’s core search capabilities are powered by the Lucene library, an open source project governed by the Apache Software Foundation. Elasticsearch ANN is no exception, and is built on an exciting new Lucene feature for storing and searching numeric vectors. This feature is the result of a great collaboration involving several developers across different organizations. Starting as a bold proposal, it quickly progressed to a working (and fast) implementation. Then came the challenge of designing the API and rounding out the feature.

Since then, the Lucene community has continued to collaborate to push the feature forward. Several developers took interest and made contributions, from redesigning names, to algorithm updates, performance improvements, and more. Lucene’s vector search capabilities are quickly expanding thanks to everyone’s efforts.

In addition to the fruitful collaboration, developing ANN in Lucene brings other major benefits. Lucene's implementation is designed at a low-level to integrate correctly with existing functionality, which allows ANN search to interact seamlessly with other Elasticsearch features. Such a deep integration would not really be possible if we depended on an external ANN library. For example, Lucene ANN transparently handles deleted documents by skipping over 'tombstones' during the graph search. It also respects all of Lucene's data compatibility guarantees, so you can be sure that vector data still works after an upgrade. Finally, the implementation written in Java just like Elasticsearch, which allows us to ensure its security and simplify memory management.

What’s next?

In 8.0, the _knn_search endpoint for efficient ANN search will be released as a “technical preview”. ANN search is a relatively new topic not only for Elastic, but for the industry, and there are significant open questions around how it should behave. What is the best way to combine vector similarity scores with traditional BM25 scores? Should kNN search support pagination? Developing ANN as its own experimental endpoint will let us quickly iterate on and test its behavior. We plan to ultimately integrate ANN into the _search API once we have solid answers for these questions. (Although _knn_search is not yet GA, the dense_vector field type was made GA in 7.6 and continues to have a stable API.)

Some key capabilities we plan to support include ANN with filters, as well as “hybrid” search where ANN results are combined with those from a traditional query. We’re also working to improve indexing speed, as building HNSW graphs can be an expensive operation. We think of this release as just a beginning, and look forward to improving ANN search over the upcoming releases. Your feedback is really valuable, and helps shape the direction of the feature. We’d love to hear from you on GitHub and our Discuss forums (and in Lucene too)!

Try out ANN search on Elastic Cloud by logging into the Elastic Cloud console or signing up for a free 14-day trial.