Introducing approximate nearest neighbor search in Elasticsearch 8.0
There has been a surge of interest in vector search, thanks to a new generation of machine learning models that can represent all sorts of content as vectors, including text, images, events, and more. Often called “embedding models”, these powerful representations can capture similarity between two pieces of content in a way that goes beyond their surface level characteristics.
k-nearest neighbor (kNN) search algorithms find the vectors in a dataset that are most similar to a query vector. Paired with these vector representations, kNN search opens up exciting possibilities for retrieval:
- Finding passages likely to contain the answer to a question
- Detecting near-duplicate images in a large dataset
- Finding songs that sound similar to a given song
Vector search is poised to become an important component of the search toolbox, alongside traditional techniques like term-based scoring.
Elasticsearch currently supports storing vectors through the dense_vector field type and using them to calculate document scores. This allows users to perform an exact kNN search by scanning all documents. Elasticsearch 8.0 builds on this functionality to support fast, approximate nearest neighbor search (ANN). This represents a much more scalable approach, allowing vector search to run efficiently on large datasets.
ANN in Elasticsearch
What is approximate nearest neighbor search?
There are well-established data structures for kNN on low-dimensional vectors, like KD-trees. In fact, Elasticsearch incorporates KD-trees to support searches on geospatial and numeric data. But modern embedding models for text and images typically produce high-dimensional vectors of 100 - 1000 elements, or even more. These vector representations present a unique challenge, as it’s very difficult to efficiently find nearest neighbors in high dimensions.
Faced with this difficulty, nearest neighbor algorithms usually sacrifice perfect accuracy to improve their speed. These approximate nearest neighbor (ANN) algorithms may not always return the true k nearest vectors. But they run efficiently, scaling to large datasets while maintaining good performance.
Choosing an ANN algorithm
Designing ANN algorithms is an active area of academic research, and there are many promising algorithms to choose from. They often present different trade-offs in terms of their search speed, implementation complexity, and indexing cost. Thankfully there is a great open source project called ann-benchmarks which tests the leading algorithms against several datasets and publishes comparisons.
Elasticsearch 8.0 uses an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW), which organizes vectors into a graph based on their similarity to each other. HNSW shows strong search performance across a variety of ann-benchmarks datasets, and also did well in our own testing. Another benefit of HNSW is that it’s widely used in industry, having been implemented in several different systems. In addition to the original academic paper, there are many helpful resources for learning about the algorithm's details. Although Elasticsearch ANN is currently based on HNSW, the feature is designed in a flexible way to let us incorporate different approaches in the future.
Show me the code!
To index vectors for ANN search, we need to set index: true and specify the similarity metric we’re using to compare them:
"image-vector": [0.12, 1.34, ...]
"query_vector": [-0.5, 9.4, ...],
_knn_search endpoint uses HNSW graphs to efficiently
retrieve similar vectors. Unlike exact kNN, which performs a full scan
of the data, it scales well to large datasets. Here’s an example that
_knn_search to the exact approach based on
queries on a dataset of 1 million image vectors with 128 dimensions,
averaging over 10,000 different queries:
Approach Queries Per Second Recall (k=10)
script_score 5.257 1.000
_knn_search 849.286 0.945
In this example, ANN search is orders of magnitude faster than the exact approach. Its recall is around 95%, so on average, it finds over 9 out of the 10 true nearest neighbors.
You can check on the performance of kNN search in the Elasticsearch nightly benchmarks. These benchmarks are powered by es-rally, a tool for Elasticsearch benchmarking, specifically the new dense_vector Rally track. We plan to extend Rally to report recall in addition to latency, as it’s also important to track the accuracy of the algorithm. Currently these benchmarks test a dataset of a couple million vectors, but ANN search can certainly scale beyond this with a greater index time or the addition of hardware resources.
Powered by Apache Lucene
Many of Elasticsearch’s core search capabilities are powered by the Lucene library, an open source project governed by the Apache Software Foundation. Elasticsearch ANN is no exception, and is built on an exciting new Lucene feature for storing and searching numeric vectors. This feature is the result of a great collaboration involving several developers across different organizations. Starting as a bold proposal, it quickly progressed to a working (and fast) implementation. Then came the challenge of designing the API and rounding out the feature.
Since then, the Lucene community has continued to collaborate to push the feature forward. Several developers took interest and made contributions, from redesigning names, to algorithm updates, performance improvements, and more. Lucene’s vector search capabilities are quickly expanding thanks to everyone’s efforts.
In addition to the fruitful collaboration, developing ANN in Lucene brings other major benefits. Lucene's implementation is designed at a low-level to integrate correctly with existing functionality, which allows ANN search to interact seamlessly with other Elasticsearch features. Such a deep integration would not really be possible if we depended on an external ANN library. For example, Lucene ANN transparently handles deleted documents by skipping over 'tombstones' during the graph search. It also respects all of Lucene's data compatibility guarantees, so you can be sure that vector data still works after an upgrade. Finally, the implementation written in Java just like Elasticsearch, which allows us to ensure its security and simplify memory management.
In 8.0, the
_knn_search endpoint for efficient ANN search will be
released as a “technical preview”. ANN search is a relatively new topic
not only for Elastic, but for the industry, and there are significant
open questions around how it should behave. What is the best way to
combine vector similarity scores with traditional BM25 scores?
Should kNN search support pagination? Developing ANN as its own
experimental endpoint will let us quickly iterate on and test its
behavior. We plan to ultimately integrate ANN into the
_search API once
we have solid answers for these questions. (Although
not yet GA, the
dense_vector field type was made GA in 7.6 and
continues to have a stable API.)
Some key capabilities we plan to support include ANN with filters, as well as “hybrid” search where ANN results are combined with those from a traditional query. We’re also working to improve indexing speed, as building HNSW graphs can be an expensive operation. We think of this release as just a beginning, and look forward to improving ANN search over the upcoming releases. Your feedback is really valuable, and helps shape the direction of the feature. We’d love to hear from you on GitHub and our Discuss forums (and in Lucene too)!