Elasticsearch DiskBBQ delivers 7x faster vector search than Qdrant on network-attached storage

Elasticsearch DiskBBQ achieves up to 7x higher vector search throughput than Qdrant at comparable recall on network-attached storage. Explore the benchmark methodology and full results.

Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.

Elasticsearch DiskBBQ delivers up to 7x higher throughput than Qdrant at comparable recall, tested on network-attached persistent storage, the topology most managed-cloud deployments actually use. The gap is consistent across recall levels from 0.93 to 0.97, and it widens as recall increases. DiskBBQ keeps latency nearly flat as search breadth grows; Qdrant's latency rises sharply as hnsw_ef increases, driven by random reads of original vectors from disk during rescoring. If you're running vector search in Kubernetes or a managed cloud environment, this is what the tradeoff looks like.

Vector search is a critical foundation for large language model (LLM) applications, retrieval augmented generation (RAG), and other AI workloads. In this benchmark, Elasticsearch achieved up to 7x higher throughput than Qdrant at comparable recall on the same storage topology. Elasticsearch as a vector database offers strong vector search performance even when network-attached persistent storage remains on the query path.

The difference reflects how the two systems interact with disk. Elasticsearch DiskBBQ is designed to keep vector search efficient when persistent storage remains on the query path, using a compact quantized representation and limiting costly access to full precision vectors during search. In this setup, Qdrant relies on a graph-based search path with rescoring against original vectors stored on disk. On network-attached persistent storage, that random access cost becomes much more significant, which is why the performance gap widens as recall increases. This benchmark therefore focuses specifically on network-attached persistent storage, a common deployment model in managed cloud and Kubernetes environments.

The key pattern in the latency curve is not only the size of the gap but also its shape. Elasticsearch latency remains comparatively flat as recall increases, suggesting that higher recall doesn’t require a dramatic increase in expensive storage activity. Qdrant’s latency rises sharply as hnsw_ef increases, which is consistent with broader candidate exploration leading to more rescoring work against original vectors on disk.

Full results table

The table below shows the full parameter sweep for both Elasticsearch and Qdrant. Because the two engines expose different tuning controls for vector search, the results are reported using each engine’s full parameter key rather than attempting a one-to-one mapping between settings.

A few notes on the metrics:

  • ParamKey: The complete parameter setting used for a given run.
  • Recall: Recall@100 against a ground-truth top-100 result set for the benchmark queries. Values range from 0 to 1, and higher is better.
  • Latency_Avg: The average end-to-end latency per query measured from the benchmarking client across the full run, in milliseconds. Lower is better.
  • Latency_P95: The 95th percentile query latency, in milliseconds, showing the upper range of typical slow queries. Lower is better.
  • Throughput: The average number of queries processed per second across the full run. Higher is better.
EngineParamKeyRecallLatency_AvgLatency_P95Throughput
qdranthnsw_ef=50, oversampling=1, size=1000.8694315.7849503.475412.629
elasticsearchk=100, oversample=1, size=100, visit_percentage=10.8789135.0802218.49429.343
elasticsearchk=100, oversample=1, size=100, visit_percentage=1.50.9123127.8286195.231831.1107
qdranthnsw_ef=100, oversampling=1, size=1000.9287895.99331213.04484.4493
elasticsearchk=100, oversample=1, size=100, visit_percentage=20.9317124.846183.631431.8225
elasticsearchk=100, oversample=1, size=100, visit_percentage=2.50.9444123.517180.483132.1883
qdranthnsw_ef=150, oversampling=1, size=1000.9518884.72361195.26034.5066
elasticsearchk=100, oversample=1, size=100, visit_percentage=30.9532123.276183.837932.2364
elasticsearchk=100, oversample=1, size=100, visit_percentage=3.50.9599122.5559184.285832.4469
qdranthnsw_ef=200, oversampling=1, size=1000.964883.21141188.65974.5143
elasticsearchk=100, oversample=1, size=100, visit_percentage=40.965122.7946184.905832.3635
elasticsearchk=100, oversample=1, size=100, visit_percentage=4.50.9689122.7062182.955932.3976
elasticsearchk=100, oversample=1, size=100, visit_percentage=50.9722122.5761187.353632.4221
qdranthnsw_ef=256, oversampling=1, size=1000.9722881.96431185.49484.5192
elasticsearchk=100, oversample=1, size=100, visit_percentage=5.50.9747122.5609184.512832.4176

Each row pairs the closest measured Elasticsearch and Qdrant configurations in the sweep by achieved recall.

Matched comparisons at similar recall

To make the comparison fair, speedup is calculated only between configurations that achieve similar recall. This avoids comparing settings that trade off accuracy very differently.

Recall bandElasticsearch recallElasticsearch Latency_AvgElasticsearch throughputQdrant recallQdrant Latency_AvgQdrant throughputThroughput speedup
~0.870.8789135.080229.3430.8694315.784912.6292.32x
~0.930.9317124.84631.82250.9287895.99334.44937.15x
~0.950.9532123.27632.23640.9518884.72364.50667.15x
~0.960.9599122.555932.44690.964883.21144.51437.19x
~0.970.9722122.576132.42210.9722881.96434.51927.17x

This matched-recall view is the clearest expression of the underlying systems difference. At similar recall levels, Elasticsearch delivers both lower latency and much higher throughput, and the gap widens as recall rises. The recall-throughput pattern matters because higher recall in this benchmark requires broader search. DiskBBQ absorbs that increase with relatively little additional cost, while Qdrant’s graph plus rescoring path becomes much more constrained by random access to original vectors on persistent storage.

Benchmark methodology

Jingra, the benchmarking tool used for these tests, was originally written in Python and has since been rebuilt as a Java project. For these tests, Jingra runs in a Kubernetes pod within the same cluster as the engine being measured. This helps reduce external network variability and keeps the test environment consistent across runs. For each run, Jingra executed the query set at a fixed client concurrency, recorded end-to-end client-side latency and throughput, and computed recall against a precomputed ground-truth top-100 set.

This benchmark was intentionally run on network-attached persistent storage rather than local NVMe. For the published results, the storage used the baseline performance allocation for a 200 GiB GCP Hyperdisk Balanced volume, with no explicit IOPS or throughput provisioning. We chose this topology on purpose because it’s a relevant cloud deployment model and because it keeps storage efficiency materially on the query path.

Qdrant often performs better on local NVMe, so deployments using local NVMe should expect different results than the ones shown here. This benchmark specifically tests network-attached persistent storage because that’s a common managed-cloud deployment model and because it makes storage-path efficiency visible in end-to-end query performance.

Because Elasticsearch and Qdrant expose different query parameters for controlling vector search behavior, there’s no clean one-to-one mapping between their tuning settings. Instead of comparing equivalent parameter values directly, we use recall as the primary point of comparison. The matched comparisons below therefore pair configurations that achieve similar recall, rather than configurations with superficially similar parameter values.

Recall cannot be known in advance for a given parameter setting, so we sweep across a range of search configurations for each engine and then compare results at similar recall levels. In the published results, oversampling was fixed at 1 for both engines so that recall was primarily tuned via search breadth rather than rescoring expansion.

How does Elasticsearch configure vector search?

  • query_vector: The input vector used for similarity search. Elasticsearch compares this vector against the stored vectors in the field.
  • k: The number of nearest neighbors to retrieve.
  • visit_percentage: Controls how much of the DiskBBQ, Elasticsearch’s disk optimized vector index, is explored during the approximate search phase. Higher values usually improve recall but increase latency.
  • oversample: Controls how many extra candidate vectors are passed into rescoring relative to k. Higher values can improve recall, but usually at additional cost.
  • size: The number of hits returned in the final response.
  • _source: false: Disables returning the document _source field, reducing response size and avoiding extra retrieval overhead during benchmarking.

Example

Params

We keep k = size = 100 so the search request is aligned with the benchmark target: returning the top 100 results. To improve recall, we tune visit_percentage rather than inflating the final result count, while keeping oversample = 1 fixed across runs.

How does Qdrant configure vector search?

  • query_vector / vector: The input vector used for similarity search. Qdrant compares this vector against the stored vectors in the collection.
  • size / limit: The number of nearest neighbor results returned in the response.
  • with_payload: false: Disables returning payload fields, reducing response size and avoiding additional retrieval overhead during benchmarking.
  • with_vector: false: Disables returning stored vectors in the response, again reducing response size and keeping the benchmark focused on search performance.
  • hnsw_ef: Controls the number of candidates explored during HNSW search. Higher values usually improve recall but increase latency. Like visit_percentage in Elasticsearch, it affects search breadth, but the two controls are engine-specific and not directly equivalent.
  • quantization.rescore: true: Enables rescoring of the candidate set using the original vectors after quantized search.
  • oversampling: Controls how many extra candidates are considered during rescoring relative to the final result count. Higher values can improve recall, but usually at additional cost.

Example

Params

We keep size = 100 so that each request is aligned with the evaluation target, in this case top 100 retrieval. Recall is then tuned by sweeping hnsw_ef, which controls how many candidates are explored during search. Higher hnsw_ef values generally improve recall but also increase latency and reduce throughput. We keep oversampling = 1 fixed across runs so that the main tuning variable is the search breadth rather than the rescoring expansion.

Cluster setup and DiskBBQ configuration

We ran the benchmark on GCP using three n4-standard-8 nodes, with each pod allocated 7 vCPUs and 26 GB of RAM, and using 200 GiB GCP Hyperdisk Balanced volumes at baseline performance allocation. The corpus contains 21 million vectors, (see dataset section below for more details and download links), which account for about 60.1 GiB of raw float vector data. With 2-bit quantization, the vector payload drops to roughly 3.8 to 4.0 GB. However, the full index footprint is much larger once graph and other index structures are included. That means the workload remains meaningfully sensitive to network-attached storage performance, especially because exact vector values still need to be read from disk during rescoring.

We chose this node size intentionally to keep the benchmark in a regime where network-attached persistent storage remains on the query path rather than allowing the full working set to remain comfortably memory-resident. Each system was therefore configured using the best-performing setup we identified for this workload within the tuning scope described in this post. In Elasticsearch, this meant bbq_disk. In Qdrant, the original vectors were stored on disk, while the 2-bit quantized representation used for approximate search was kept in RAM with always_ram: true. Because the two systems expose different search strategies and tuning controls, we compare them at matched recall rather than trying to map parameters one to one.

Elasticsearch was configured to use DiskBBQ, its disk-optimized approach for approximate nearest neighbor vector search, with 2-bit quantization. DiskBBQ uses aggressive quantization to keep the searchable index compact and then rescores with the original vectors to preserve accuracy. This helps maintain strong recall while keeping disk-based search efficient.

bbq_disk is an Elasticsearch Enterprise feature. We used it here because the goal of this benchmark was to compare the strongest disk-oriented vector search configuration available in each engine for this workload, rather than licensing tiers or default features.

We didn’t include bbq_hnsw in this comparison because the benchmark was specifically designed to evaluate disk-oriented vector search under a disk-sensitive workload.

This storage topology matters because Qdrant’s rescore step reads the original float32 vectors from disk with random access on each query. On local NVMe, those reads are much faster, and Qdrant correspondingly performs better. On network-attached persistent storage, the results are consistent with that random-read rescore path becoming a more important bottleneck. Qdrant latency rises sharply as hnsw_ef increases, while Elasticsearch remains comparatively flat across the same recall progression.

We chose 2-bit quantization because Qdrant couldn’t reach the target recall range with 1-bit binary quantization. Since the two systems expose different disk-oriented vector search strategies, we tuned each one to the strongest configuration available within its current feature set.

Both systems were configured with three shards distributed across the three nodes and with two total copies of each shard in the cluster. In Elasticsearch, number_of_shards: 3 and number_of_replicas: 1 means one primary plus one replica, for two total copies. In Qdrant, shard_number: 3 and replication_factor: 2 also means two total copies, since Qdrant’s replication factor refers to the total number of copies rather than the number of additional replicas. So although the field names differ, the effective replication level was the same in both systems.

SettingElasticsearchQdrant
Shardsnumber_of_shards: 3shard_number: 3
Copiesnumber_of_replicas: 1 (1 primary + 1 replica = 2 total)replication_factor: 2 (2 total)

Elasticsearch mapping

Qdrant mapping

Dataset

For this benchmark, we used the kenhktsui/wiki_dpr_e5 dataset from Hugging Face, a large-scale Wikipedia passage retrieval dataset designed for dense vector search. The corpus contains 21 million embedded passages, each represented as a 768-dimensional float32 vector, or 3,072 bytes per vector. That corresponds to about 60.1 GiB of raw vector data, before accounting for additional fields and file format overhead in the source dataset. The downloadable data.parquet file is larger at 85.2 GB for that reason.

We chose this dataset because it reflects a common production pattern in LLM, RAG, and retrieval systems: searching a large corpus of semantically embedded text while balancing recall, latency, and throughput. At 21 million vectors and roughly 60 GiB of raw vector data, it’s large enough to make disk-based vector search a relevant operating mode to evaluate.

Both engines used 2-bit quantization, reducing each vector from 3,072 bytes to 192 bytes, a 16x reduction that brings the quantized vector corpus to around 4 GB. In Qdrant, that quantized representation was kept in RAM for search, while the original vectors remained on disk. Even so, the workload remained meaningfully sensitive to network-attached storage performance because rescoring still required access to the original vectors on disk.

You can download the dataset and query files from the links below:

Jingra and recreating the benchmark

For this benchmark, we used Jingra v0.2.3 with the configurations described es-9.4-vs-qd-1.18-vector-search. Jingra handled data loading, query execution, parameter sweeps, and metric collection for both Elasticsearch and Qdrant, making the benchmark repeatable and easier to compare.

To reproduce the experiment, you need the published dataset, query set, engine configurations, and comparable cluster hardware. With those in place, Jingra can rerun the benchmark and generate similar recall, latency, and throughput measurements shown in this post.

Conclusion

At comparable recall levels, Elasticsearch DiskBBQ consistently delivered faster vector search than Qdrant in this benchmark, with higher throughput and lower latency across the recall range we tested. These results are especially notable because the comparison was made on network-attached persistent storage, where efficient storage-aware vector search becomes critical. Elasticsearch as a vector database allows organizations to achieve high recall with lower latency and higher throughput on slower persistent storage.

Just as importantly, this benchmark highlights the value of comparing engines at matched recall rather than by nominal parameter settings. Elasticsearch and Qdrant expose different controls, so the fairest comparison isn’t parameter to parameter but outcome to outcome. Across the recall range tested here, Elasticsearch maintained a clear advantage in both latency and throughput.

If you want to reproduce the experiment yourself, we’re publishing the dataset and query set used in this benchmark so others can validate the results and build on them.

Further reading:

这些内容对您有多大帮助?

没有帮助

有点帮助

非常有帮助

相关内容

准备好打造最先进的搜索体验了吗?

足够先进的搜索不是一个人的努力就能实现的。Elasticsearch 由数据科学家、ML 操作员、工程师以及更多和您一样对搜索充满热情的人提供支持。让我们联系起来,共同打造神奇的搜索体验,让您获得想要的结果。

亲自试用