HNSW vs. LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs
HNSW, quantization, and recall/speed tradeoff

At 15,000 queries per second, Elasticsearch hierarchical navigable small world (HNSW) reaches recall@10 of 0.99 on float32 vectors. With DiskBBQ quantization, it delivers 0.97 recall at 55,000 QPS — 3.7x the throughput for a 0.02 recall reduction, according to Elasticsearch Labs benchmarks. This blog explains how that is possible, including:
Why exact nearest neighbor search breaks down at scale
How approximate nearest neighbor (ANN) indices transform the problem
How HNSW works layer by layer
How quantization cuts memory without destroying recall
What all of this means for the Elasticsearch mappings and queries you write today
Every section includes code, numbers, or a diagram.
Why exact nearest neighbor search does not scale
Finding the most similar vector to a query sounds straightforward: Compute the distance from the query to every vector in the dataset and return the closest one. For 1,000 vectors at 768 dimensions, that is about 1.5 million floating-point operations per query. Fast enough.
At 10 million vectors, it’s 15 billion operations; at 1 billion vectors, it’s 750 billion operations. Query latency scales linearly with dataset size. There is no way around this for exact search.
The first instinct is to reach for a spatial index. KD-trees partition space along coordinate axes, which works well in 2–10 dimensions. A KD-tree query visits only the partitions that could plausibly contain the nearest neighbor, skipping large portions of the dataset. In practice, this brings query time from O(n) down to roughly O(log n) for low-dimensional data.
The problem: Most real embedding models produce vectors with 384 to 1,536 dimensions. In high-dimensional space, every point is roughly the same distance from every other point. This phenomenon — the curse of dimensionality — means a KD-tree partition boundary almost never rules out a subtree as a candidate. The tree traversal degrades to visiting most of the tree anyway. Empirically, KD-trees stop outperforming brute force above around 20 dimensions.
Ball-trees and R-trees have the same problem. They restructure the partitioning geometry, but the curse of dimensionality affects all exact partition-based structures in high dimensions.
The fundamental issue is that exact search and high-dimensional data are in conflict. Approximate nearest neighbor (ANN) search resolves this by accepting a small, controllable amount of recall loss in exchange for query times that are orders of magnitude faster.
How ANN indices transform the search problem
Instead of finding the exact nearest neighbor, ANN indexes find a set of results that is very likely to contain the true nearest neighbors. For most retrieval applications — semantic search, retrieval augmented generation (RAG) pipelines, and recommendation systems — results that are extremely close to the true nearest neighbors are just as useful. A cosine similarity of 0.94 is not meaningfully worse than 0.95 for document retrieval.
ANN indexes build a data structure during index construction that allows query-time traversal to skip most of the dataset while still reaching the right neighborhood. Three families of index structures dominate:
flowchart TD
A[Raw vectors] --> B[Index construction]
B --> C{Index type}
C --> D[HNSW graph]
C --> E[IVF clusters]
C --> F[LSH buckets]
D --> G[Query vector]
E --> G
F --> G
G --> H[Index traversal]
H --> I[Approximate top-k results]HNSW builds a multilayer proximity graph. Inverted File (IVF) indices cluster vectors and search only the most relevant clusters at query time. Locality-sensitive hashing (LSH) maps vectors to hash buckets where similar vectors are likely to land in the same bucket. Each structure trades construction time and memory for faster queries with tunable recall.
HNSW has become the dominant production choice for dense vector search. It achieves higher recall at high QPS than IVF or LSH on cosine and Euclidean similarity metrics, and its recall/speed tradeoff is tunable without rebuilding the index.
HNSW: The algorithm behind production vector search
HNSW was introduced by Malkov and Yashunin in 2016 and has been the foundation of most production-grade dense vector search systems ever since. The algorithm builds a layered graph where each layer is a subset of the full dataset, and higher layers contain fewer nodes with longer range connections.
The structure works like a highway system:
Layer 2 has a small number of nodes connected by long-range edges like motorways that let you cross large distances quickly.
- Layer 1 adds more nodes with shorter edges like regional roads.
- Layer 0 contains every node in the dataset, densely connected to its local neighbors.
A query proceeds top-down:
Enter the graph at a fixed entry point in the highest layer.
Greedily traverse to the nearest neighbor in that layer.
Descend to the next layer at the same node.
Repeat until reaching Layer 0.
In Layer 0, perform a beam search to find the approximate k nearest neighbors.
This greedy traversal is fast because each layer dramatically narrows the search region before descending. By the time the query reaches Layer 0, it is already in the right neighborhood.
Two parameters control the index construction trade-off:
m: The number of bidirectional connections each node creates when inserted. Higher m means more connections per node, which increases recall at the cost of memory and construction time. Typical values: 16-64.
- ef_construction: The size of the candidate list during graph construction. Higher values find better connections during build, increasing recall at query time, at the cost of slower indexing. Typical values: 100-500.
A third parameter, num_candidates (called ef in some implementations), controls query-time beam width in Layer 0. Increasing num_candidates increases recall at the cost of slower individual queries. This parameter can be tuned per query without rebuilding the index, making HNSW practical for applications where different use cases have different recall requirements.
One honest trade-off: HNSW indexes are memory-intensive. A float32 HNSW index over 10 million 768-dimensional vectors occupies roughly 30 GB of memory before accounting for graph structure overhead. Quantization addresses this directly.
Quantization: Smaller indexes, faster search
Quantization compresses vector data by representing each dimension with fewer bits. The recall loss depends on how aggressively you quantize, but modern methods preserve most of the recall while cutting memory by 4x to 32x.
Scalar quantization
Scalar quantization maps each float32 dimension to an int8 value. This reduces memory from 4 bytes per dimension to 1 byte — a 4x reduction. For a 768-dimensional vector, that is 3,072 bytes versus 768 bytes. At 10 million vectors, the difference is 29 GB versus 7.3 GB.
Recall loss is small because distance comparisons between int8-quantized vectors preserve relative ordering well for cosine and dot product similarity. In practice, scalar quantization at the same HNSW parameters delivers recall@10 of around 0.97–0.98 at throughput levels that float32 cannot reach.
Product quantization
Product quantization (PQ) splits each vector into sub-vectors and quantizes each independently using a learned codebook. This achieves much higher compression ratios than scalar quantization, often reducing to 1/8 or 1/16 of the original size. The trade-off is higher recall loss and a training step required before indexing.
PQ works well when memory is severely constrained and the application can tolerate recall@10 in the 0.85–0.90 range.
DiskBBQ
DiskBBQ is Elasticsearch's disk-based binary quantization. It reduces each vector dimension to a single bit, achieving a 32x reduction in vector storage. A 768-dimensional float32 vector occupies 3,072 bytes. The same vector in DiskBBQ occupies 96 bytes.
The key architectural design: DiskBBQ stores the compressed vectors on disk while keeping the HNSW graph structure in memory. Query execution uses the in-memory graph for traversal then reads compressed vectors from disk for distance comparisons. Because the vectors are binary-quantized, those disk reads are small and fast.
According to Elasticsearch Labs benchmarks, HNSW with DiskBBQ reaches 0.97 recall@10 at 55,000 QPS, compared to float32 HNSW at 0.99 recall@10 and 15,000 QPS. That is a 3.7x QPS increase for a 0.02 recall reduction with total index memory reduced by roughly 8x because the graph fits in RAM while vectors live on disk. For large datasets where DRAM cost is the binding constraint, DiskBBQ changes what is feasible.
DiskBBQ is not free. The disk I/O adds latency per query compared to a fully in-memory approach. The throughput gain comes from being able to run more queries concurrently on hardware that would otherwise run out of memory. If your dataset fits comfortably in RAM, float32 HNSW or scalar quantization will give higher per-query recall at lower latency.
LSH and KD-trees: When to consider alternatives
Locality-sensitive hashing applies a family of hash functions designed so similar inputs hash to the same bucket with high probability. A query hashes into one or more buckets and retrieves only the candidates in those buckets, skipping the rest of the dataset.
LSH has a specific niche: It performs well for Jaccard similarity (set overlap) and Hamming distance (bit-level differences). For these metrics, LSH buckets can be constructed to preserve the right notion of similarity. MinHash LSH for Jaccard and SimHash for Hamming are well-understood and widely deployed in near-duplicate detection and document deduplication.
For dense cosine or Euclidean similarity on high-dimensional float vectors — what embedding models produce — LSH performs worse than HNSW at comparable recall levels. At 40,000 QPS, LSH delivers recall@10 of around 0.87, while HNSW with scalar quantization reaches 0.97 at the same throughput. That 0.10 recall gap matters for retrieval quality in RAG pipelines and semantic search.
KD-trees are appropriate only below about 20 dimensions. They work well for 2D/3D spatial queries, geospatial lookups, and low-dimensional feature matching. Above 20 dimensions, the query time approaches linear scan and the construction cost is not worth paying.
The practical decision is to use HNSW for dense float vectors like semantic search, embeddings, and RAG. Use LSH for Jaccard and Hamming. Use brute force for datasets under 50,000 vectors where construction overhead is not justified.
HNSW in Elasticsearch
Elasticsearch exposes HNSW through the dense_vector field type. The index mapping below creates a field with float32 HNSW, setting m=16 and ef_construction=100.
PUT /my-search-index
{
"mappings": {
"properties": {
"embedding": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine",
"index_options": {
"type": "hnsw",
"m": 16,
"ef_construction": 100
}
},
"content": {
"type": "text"
}
}
}
}The kNN query below returns the 10 nearest neighbors from a candidate pool of 100. num_candidates is the ef parameter for Layer 0 traversal; increasing it improves recall at the cost of query latency.
GET /my-search-index/_search
{
"knn": {
"field": "embedding",
"query_vector": [0.12, -0.34, 0.56, "..."],
"k": 10,
"num_candidates": 100
},
"fields": ["content"],
"_source": false
}To enable DiskBBQ quantization, change index_options.type to bbq_hnsw:
"index_options": {
"type": "bbq_hnsw",
"m": 16,
"ef_construction": 100
}For applications where you want zero-configuration semantic search without managing embeddings or mappings manually, the semantic_text field type handles embedding generation, chunking, and kNN search automatically. You provide the inference endpoint, and Elasticsearch manages the rest:
PUT /my-docs-index
{
"mappings": {
"properties": {
"body": {
"type": "semantic_text",
"inference_id": "my-elser-endpoint"
}
}
}
}semantic_text is the right choice for teams that want semantic search in production without tuning HNSW parameters. For teams that need control over recall targets, memory budgets, or hybrid BM25 + kNN queries, the explicit dense_vector mapping gives the knobs you need.
Choosing the right ANN configuration
The right approach depends on dataset size and the memory budget available for the index.
| Dataset size | Recommended approach |
| <100K vectors | Brute force (exact search) or HNSW float32. Index construction overhead is minimal; exact search may be fast enough for the QPS you need. |
| 100K–10M vectors | HNSW float32 with m=16, ef_construction=100. Tune num_candidates upward until you hit your recall target. Add scalar quantization if memory is a constraint. |
| 10M–1B vectors | HNSW with scalar quantization (int8) or DiskBBQ. DiskBBQ is the better choice if DRAM is the bottleneck; use scalar quantization if you need lower per-query latency and can fit the index in memory. |
| >1B vectors | DiskBBQ with careful hardware sizing or a tiered architecture with hot/warm separation. At this scale, operating cost dominates and disk-based quantization is often the only practical option. |
For hybrid search applications that combine BM25 keyword scoring with kNN vector scoring, start with m=16 and num_candidates at roughly 10x your target k. Then measure recall@k on a representative query set and increase num_candidates until you hit your target. A 2x increase in num_candidates typically costs 20–40% in query latency.
Common misconceptions about ANN
ANN is not the same as kNN in machine learning
Approximate nearest neighbor (ANN) search and k-nearest neighbors (kNN) classification share the words "nearest neighbor" but describe different things. kNN classification is a supervised learning algorithm that assigns a class label to a new data point based on the majority class among its k nearest neighbors in the training set. It is a model.
ANN search is an index structure and retrieval algorithm. It finds the k-most similar vectors to a query vector in a large dataset. It is not a classifier. When Elasticsearch documentation refers to "kNN search," it means ANN-based vector retrieval — the same thing as ANN search, using the term kNN to describe the intent (find k nearest neighbors) rather than the algorithm family.
Approximate does not mean low-quality results
"Approximate" in ANN search refers to the algorithm's guarantee structure, not the quality of results in practice. A well-configured HNSW index at recall@10 = 0.97 means that on average, 9.7 of the true 10 nearest neighbors appear in the results. For semantic search, document retrieval, or RAG, a result ranked 11th in the ground truth is not meaningfully worse than a result ranked 10th.
The situations where approximation matters include reranking pipelines that depend on exact distances or scientific applications where the ground truth ranking is the metric. For the vast majority of production retrieval workloads, recall@10 of 0.95 or above is indistinguishable from exact search in downstream quality metrics.
Frequently asked questions
What is approximate nearest neighbor search? Approximate nearest neighbor (ANN) search finds vectors that are very likely to be among the true nearest neighbors to a query vector, without exhaustively comparing every vector in the dataset. ANN indexes like HNSW build a data structure during indexing that allows queries to navigate directly to the right neighborhood, skipping most of the dataset. The trade-off is a small, tunable reduction in recall in exchange for query speeds orders of magnitude faster than exact search.
How does the HNSW algorithm work? HNSW builds a multi-layer graph where higher layers contain fewer nodes with long-range connections, and the bottom layer (Layer 0) contains all nodes densely connected to their local neighbors. A query starts at a fixed entry point in the highest layer, greedily moves toward the nearest neighbor in that layer, descends to the next layer, and repeats until reaching Layer 0. This top-down traversal navigates quickly from a coarse global view to a fine local neighborhood, examining only a small fraction of the total dataset.
What is recall@k and why does it matter? Recall@k measures what fraction of the true k nearest neighbors appear in the approximate results. Recall@10 = 0.97 means that on average, 9.7 of the true 10 nearest neighbors are returned. It matters because retrieval quality in semantic search, RAG, and recommendation systems depends directly on whether the most relevant documents reach the reranking or generation stage. A recall@10 below 0.90 means one in ten relevant results is silently missing; that degrades downstream quality in ways that are hard to debug.
How does HNSW in Elasticsearch compare to OpenSearch for vector search? Elasticsearch Labs benchmarks show Elasticsearch HNSW achieves higher recall@10 than OpenSearch at equivalent QPS, with the gap widening when DiskBBQ quantization is applied. At high throughput ranges where DiskBBQ puts Elasticsearch at 40,000-55,000 QPS with 0.96-0.97 recall, OpenSearch configurations with comparable quantization fall below this recall level at the same throughput. The difference comes from how Elasticsearch separates the HNSW graph (in memory) from compressed vectors (on disk) in DiskBBQ, which keeps graph traversal fast while shrinking the total memory footprint.
What is DiskBBQ and when should I use it? DiskBBQ is Elasticsearch's disk-based binary quantization for HNSW. It compresses each vector dimension to a single bit, storing compressed vectors on disk while keeping the HNSW graph in memory. This reduces vector storage by 32x compared to float32, making much larger datasets feasible within a given hardware budget. Use DiskBBQ when your dataset is too large to fit in memory with float32 or scalar quantization, or when you need to maximize QPS by freeing RAM for graph traversal. On benchmarks, DiskBBQ delivers recall@10 of 0.97 at 55,000 QPS, versus 0.99 at 15,000 QPS for float32.
Can I combine vector search with keyword search in Elasticsearch? Yes. Elasticsearch supports hybrid search that combines BM25 keyword scoring with kNN vector scoring in a single query. Reciprocal Rank Fusion (RRF) is the recommended fusion strategy; it combines rankings from both retrievers without requiring manual score normalization. The semantic_text field type handles this automatically when used with the semantic query. For explicit control, use the knn clause alongside a query clause in the same search request, then apply rrf in the rank parameter.
What are the trade-offs of approximate nearest neighbor search? The main trade-off is recall versus speed and memory. A higher num_candidates value gives better recall but slower queries. More aggressive quantization (DiskBBQ vs. scalar vs. float32) cuts memory and increases throughput but reduces recall. HNSW itself trades higher memory and construction time for better recall at high QPS compared to LSH or IVF. In practice, the trade-offs are tunable: you can hit recall@10 = 0.97 at 55,000 QPS with DiskBBQ, or 0.99 at 15,000 QPS with float32 — the right choice depends on your dataset size, DRAM budget, and latency target.
How does LSH compare to HNSW, and when should I use LSH instead? LSH works by hashing similar vectors into the same buckets, then searching only those buckets at query time. For dense cosine or Euclidean similarity on high-dimensional float vectors, HNSW outperforms LSH: at 40,000 QPS, HNSW with scalar quantization delivers recall@10 around 0.97 versus LSH at approximately 0.87. Use LSH for Jaccard similarity (MinHash LSH) or Hamming distance (SimHash), where the hash functions are purpose-built for those metrics. For embedding-based semantic search, RAG pipelines, or recommendation systems, HNSW is the better choice.
This blog was originally published on April 17, 2024.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.
Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of elasticsearch B.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.