Elasticsearch DiskBBQ: 40% faster vector search with native SIMD

Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.

Elasticsearch DiskBBQ already speeds up Better Binary Quantization (BBQ) vector operations through block-based layouts. Elasticsearch 9.4 adds native single instruction, multiple data (SIMD) kernels that improve throughput to almost 3x from single vector operations. Here’s how the format works.

DiskBBQ reads vector bytes directly from memory-mapped index files using dense, sequential block layouts. This keeps heap usage low and allows the engine to operate on datasets much larger than available RAM. Within these block layouts, DiskBBQ encodes doc ID blocks to reduce disk space while keeping decoding costs low. Here we walk through the posting list layout, doc ID packing modes, and why bulk scoring is fast.

How DiskBBQ stores vector posting lists on disk

At the start of each posting list, we write a centroid score correction, the vector count, and the overall doc ID encoding method. Then we write doc IDs and vectors in subsequent blocks. The blocks and the internals are in ascending doc ID order. This respects segment doc ID order, which preserves index sort order. Filters aligned with that order are more likely to include or exclude whole blocks at a time. After the doc IDs, we store the quantized vector bytes and then all the corrections for the vectors in that block.

Diagram showing the structure of a posting list with labeled sections for metadata, document IDs, quantized vectors, and correction columns. The top box lists centroid correction, vector count, and document encoding. Below it, block 0 includes docIDs, quantized vectors, and correction columns labeled lower, upper, comp_sum, and additional. A dashed box at the bottom indicates continuation through block N and tail. — The layout of a posting list. Starting with a metadata header and then each block. The blocks have `bulk_size` doc IDs, and then quantized vector bytes, followed up by the quantized correction values.

How DiskBBQ compresses doc IDs to reduce disk space

To ensure fast decoding of doc ID blocks, DiskBBQ encodes every block with the same encoding format. DiskBBQ computes the required encoding for each block and then uses the most space-expensive encoding required by any block for the posting list, which is used for all the document blocks. At the time of writing, there are five compression options for doc ID storage. Each option starts with a full delta encoding of the doc ID values. Then one of the following encoding types is applied on top of the delta encoded doc IDs.

Encoding type	Condition	Bytes saved (example)
Continuous	IDs are sequential (max−min+1 == len)	16 bytes → 4 bytes
Delta 16	All deltas fit in 16 bits	64 bytes → 33 bytes
21 bits per value	Values fit in 21 bits	12 bytes → 8 bytes
24 bits per value	Values fit in 24 bits	64 bytes → 48 bytes
Fallback (full int)	Any other case	No reduction

The most efficient is continuous. This is used when max(doc_block) - min(doc_block) + 1 == len(doc_block), meaning, the delta encoding only needs to worry about storing the minimum value, and the doc IDs can be reconstituted by adding one to each subsequent value. An example would be the IDs [4858192, 4858193, 4858194, 4858195]. Instead of writing four individual int values, which is 16 bytes, we only need to write a single value: 4858192.

Diagram showing continuous document identifiers represented as blue boxes labeled doc0 through doc3, followed by an arrow pointing to green boxes labeled VInt and byte. Text above indicates conversion from continuous doc_ids to a single variable integer value for the minimum document ID. — Continuous encoding; only needs to write a single integer.

Next is delta 16. It applies when every delta fits in 16 bits, which can be stored in two bytes. Assume we have doc_ids = [1000, 1003, 1010, 1020, 1041, 1055, 1070, 1090, 1100, 1125, 1200, 1300, 2000, 4000, 16000, 66000]. This then means our min = 1000 and results in deltas = [0, 3, 10, 20, 41, 55, 70, 90, 100, 125, 200, 300, 1000, 3000, 15000, 65000]. These 16 deltas can be packed into eight int32 values (32 bytes), plus the min value, cutting the byte usage by almost 2x.

Diagram showing a data encoding process with three color‑coded sections. On the left, green boxes labeled min, VInt, and ellipsis represent writing a minimum document value. In the center, two rows of blue boxes are marked as source delta bytes. Each row has an arrow pointing to orange boxes on the right with the same labels, illustrating packing of two 16‑bit deltas into each 32‑bit integer. — Delta 16; writes the minimum value and then packs two 16-bit deltas into each 32-bit integer.

The next step up is 21 bits per value. This results in a fairly complex scheme where each triplet set of values is compressed into 64 bit values and a tail of 3 bytes for any remaining values. A concrete example would be doc_ids=[1000, 70000, 140000], which get compressed into a single 64 bit value doc0 | (doc1 << 21) | (doc2 << 42), the end result being three raw integer values, which comprise 12 bytes, get compressed into 8 bytes.

Diagram showing blue boxes labeled with letter‑number ranges representing source list ranges. Arrows point to two sets of purple boxes labeled encoded set #0 bytes and encoded set #1 bytes, each containing grouped values like A20..13 and B20..13. Notes on the right explain that the byte blocks are not byte‑aligned and that tail parts of three documents are packed into one 64‑bit word. — Twenty-one bits per value, a fairly complex scheme combining triplet compressed 64-bit values with a 3-byte tail.

The second-to-last compression option packs integer values that require at most 24 bits. Since an int requires 32 bits of space, 24 leaves an entire byte completely unused. We cannot leave a single byte of real estate to waste. We want to fully fill as many bytes as we can, so this scheme compresses by filling in that empty byte. For example, docs_ids = [1000, 70000, 140000, 300000, 500000, 800000, 1000000, 1300000, 1600000, 1900000, 2200000, 3000000, 4000000, 8000000, 12000000, 16000000] will compress the final four integer values into the “free byte” in the prior integers, thus storing 16 integers, which usually cost 64 bytes, into 48 bytes.

Diagram showing two rows of colored boxes representing source value bytes and their reordered form in an encoded integer. The top row is labeled “source value bytes,” and the bottom row “encoded int #0,” illustrating how multiple byte groups are packed together for 24‑bit values. — This shows an example 24 bits per value where the final set of bytes is packed in the first. It’s taking advantage of that free byte of real estate.

The final option is the fallback. DiskBBQ will store each doc ID as a full fidelity integer, providing no disk reduction. Given that the doc ID values are delta encoded before compression, the fallback is exceptionally rare.

Why DiskBBQ bulk scoring is faster: SIMD and CPU cache saturation

Storing the vectors in blocks allows SIMD-optimized bulk scoring. This keeps CPU cache lines saturated with vector bytes and allows quantization corrections to be applied with SIMD-optimized kernels. If vectors were stored inline with their corrections, after each vector score, the corrections would have to be applied. This loses valuable throughput and any opportunity for optimizing the correction application.

Here’s a benchmark showing how the throughput improves with different optimizations. This is a JMH benchmark run on an M1Max MacBook with Java 25. The vector dimensions were 1024, and each operation in the benchmark was 10 queries executed over 10 blocks of 32 vectors.

Here’s a description of each of the benchmarks run above:

Float32Scalar: This is the pure JVM doing floating point operations. No hand-optimized SIMD.
Float32PanamaVector: Here some SIMD-optimized code paths are actually written and used.
BBQIndividual: These are the individual bit-wise BBQ operations. Each vector is taken onto the JVM heap individually and scored and corrected.
BBQBulkPartial: This is off-heap bulk scoring with Panama Vector operations reading directly from MMAP’d file segments. The corrections are then applied on the JVM Heap.
BBQBULK: This is full off-heap bulk scoring where vectors and corrections are SIMD-optimized Java Vector API functions reading directly from MMAP files.
BBQBulkNative: This is what’s in Elasticsearch 9.4. Full native bulk vector operations reading bytes directly from the index files.

These results show the evolution of throughput, starting with the bare minimum of individual floating point operations. Switching to SIMD (hand-optimized with the Vector API) for floating point increases throughput ~3x, but even then, it's slower than the auto-vectorized individual bit-wise operations in BBQ. Then, the switch to bulk scoring almost increases BBQ throughput by 2x. Adding our new optimized native SIMD kernels in Elasticsearch 9.4, we get yet another significant improvement, adding up to almost 3x improvement from individual bitwise scoring and an incredible 66x improvement over float32 operations.

Animated diagram showing the bulk scoring path in Elasticsearch DiskBBQ. It includes a query vector, document vectors labeled v0–v7, correction operations (lower, upper, sum, add), and resulting scores, illustrating how the query vector and block of document vectors pass through SIMD‑optimized correction steps to produce scores. — This shows the typical bulk scoring path. The query vector calculates initial score information for every vector in the block and then applies each correction with an optimized SIMD block operation.

Animated diagram showing the typical single scoring path. It displays a query vector, document vectors labeled v0–v7, correction steps labeled lower, upper, sum, and add, and resulting scores, reflecting how each vector is scored individually and its corrections applied sequentially. — Here’s the typical single scoring path. Each vector is scored and then its corrections applied. This means that vector bytes don’t get to saturate CPU cache, and correction applications cannot be applied with the same SIMD block operation.

DiskBBQ vector search performance: what native SIMD means in practice

Native SIMD kernels in Elasticsearch 9.4 deliver nearly 3x faster BBQ vector scoring and 66x the throughput of float32 operations. We aren’t done; there are always ways to improve, but this native SIMD block scoring pattern works well on all CPU architectures, allowing DiskBBQ to be fast no matter where it's deployed.

이 콘텐츠가 얼마나 도움이 되었습니까?

도움이 되지 않음

어느 정도 도움이 됩니다

매우 도움이 됨

문제 신고하기

Elasticsearch DiskBBQ: 40% faster vector scoring with native SIMD Blocks

How DiskBBQ stores vector posting lists on disk

How DiskBBQ compresses doc IDs to reduce disk space

Why DiskBBQ bulk scoring is faster: SIMD and CPU cache saturation

DiskBBQ vector search performance: what native SIMD means in practice

이 콘텐츠가 얼마나 도움이 되었습니까?

관련 콘텐츠

How BBQ shrinks Jina v5 embeddings by 29x without losing recall in Elasticsearch

Short queries, formal documents: how HyDE improved semantic search precision by 50% in Elasticsearch

A simdvec deep-dive: How Elasticsearch uses neural-net and video-codec CPU instructions for vector search

Elasticsearch DiskBBQ delivers 7x faster vector search than Qdrant on network-attached storage

Jingra: A Reproducible Framework for Vector Search Benchmarking

최첨단 검색 환경을 구축할 준비가 되셨나요?