Elasticsearch DiskBBQ: 40% faster vector scoring with native SIMD Blocks

A deep dive into how DiskBBQ's block layout, doc ID compression modes and native SIMD kernels combine to deliver 40% improved vector scoring throughput for DiskBBQ in 9.4.

Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.

Elasticsearch DiskBBQ already speeds up Better Binary Quantization (BBQ) vector operations through block-based layouts. Elasticsearch 9.4 adds native single instruction, multiple data (SIMD) kernels that improve throughput to almost 3x from single vector operations. Here’s how the format works.

DiskBBQ reads vector bytes directly from memory-mapped index files using dense, sequential block layouts. This keeps heap usage low and allows the engine to operate on datasets much larger than available RAM. Within these block layouts, DiskBBQ encodes doc ID blocks to reduce disk space while keeping decoding costs low. Here we walk through the posting list layout, doc ID packing modes, and why bulk scoring is fast.

How DiskBBQ stores vector posting lists on disk

At the start of each posting list, we write a centroid score correction, the vector count, and the overall doc ID encoding method. Then we write doc IDs and vectors in subsequent blocks. The blocks and the internals are in ascending doc ID order. This respects segment doc ID order, which preserves index sort order. Filters aligned with that order are more likely to include or exclude whole blocks at a time. After the doc IDs, we store the quantized vector bytes and then all the corrections for the vectors in that block.

How DiskBBQ compresses doc IDs to reduce disk space

To ensure fast decoding of doc ID blocks, DiskBBQ encodes every block with the same encoding format. DiskBBQ computes the required encoding for each block and then uses the most space-expensive encoding required by any block for the posting list, which is used for all the document blocks. At the time of writing, there are five compression options for doc ID storage. Each option starts with a full delta encoding of the doc ID values. Then one of the following encoding types is applied on top of the delta encoded doc IDs.

Encoding typeConditionBytes saved (example)
ContinuousIDs are sequential (max−min+1 == len)16 bytes → 4 bytes
Delta 16All deltas fit in 16 bits64 bytes → 33 bytes
21 bits per valueValues fit in 21 bits12 bytes → 8 bytes
24 bits per valueValues fit in 24 bits64 bytes → 48 bytes
Fallback (full int)Any other caseNo reduction

The most efficient is continuous. This is used when max(doc_block) - min(doc_block) + 1 == len(doc_block), meaning, the delta encoding only needs to worry about storing the minimum value, and the doc IDs can be reconstituted by adding one to each subsequent value. An example would be the IDs [4858192, 4858193, 4858194, 4858195]. Instead of writing four individual int values, which is 16 bytes, we only need to write a single value: 4858192.

Next is delta 16. It applies when every delta fits in 16 bits, which can be stored in two bytes. Assume we have doc_ids = [1000, 1003, 1010, 1020, 1041, 1055, 1070, 1090, 1100, 1125, 1200, 1300, 2000, 4000, 16000, 66000]. This then means our min = 1000 and results in deltas = [0, 3, 10, 20, 41, 55, 70, 90, 100, 125, 200, 300, 1000, 3000, 15000, 65000]. These 16 deltas can be packed into eight int32 values (32 bytes), plus the min value, cutting the byte usage by almost 2x.

The next step up is 21 bits per value. This results in a fairly complex scheme where each triplet set of values is compressed into 64 bit values and a tail of 3 bytes for any remaining values. A concrete example would be doc_ids=[1000, 70000, 140000], which get compressed into a single 64 bit value doc0 | (doc1 << 21) | (doc2 << 42), the end result being three raw integer values, which comprise 12 bytes, get compressed into 8 bytes.

The second-to-last compression option packs integer values that require at most 24 bits. Since an int requires 32 bits of space, 24 leaves an entire byte completely unused. We cannot leave a single byte of real estate to waste. We want to fully fill as many bytes as we can, so this scheme compresses by filling in that empty byte. For example, docs_ids = [1000, 70000, 140000, 300000, 500000, 800000, 1000000, 1300000, 1600000, 1900000, 2200000, 3000000, 4000000, 8000000, 12000000, 16000000] will compress the final four integer values into the “free byte” in the prior integers, thus storing 16 integers, which usually cost 64 bytes, into 48 bytes.

The final option is the fallback. DiskBBQ will store each doc ID as a full fidelity integer, providing no disk reduction. Given that the doc ID values are delta encoded before compression, the fallback is exceptionally rare.

Why DiskBBQ bulk scoring is faster: SIMD and CPU cache saturation

Storing the vectors in blocks allows SIMD-optimized bulk scoring. This keeps CPU cache lines saturated with vector bytes and allows quantization corrections to be applied with SIMD-optimized kernels. If vectors were stored inline with their corrections, after each vector score, the corrections would have to be applied. This loses valuable throughput and any opportunity for optimizing the correction application.

Here’s a benchmark showing how the throughput improves with different optimizations. This is a JMH benchmark run on an M1Max MacBook with Java 25. The vector dimensions were 1024, and each operation in the benchmark was 10 queries executed over 10 blocks of 32 vectors.

Here’s a description of each of the benchmarks run above:

  • Float32Scalar: This is the pure JVM doing floating point operations. No hand-optimized SIMD.
  • Float32PanamaVector: Here some SIMD-optimized code paths are actually written and used.
  • BBQIndividual: These are the individual bit-wise BBQ operations. Each vector is taken onto the JVM heap individually and scored and corrected.
  • BBQBulkPartial: This is off-heap bulk scoring with Panama Vector operations reading directly from MMAP’d file segments. The corrections are then applied on the JVM Heap.
  • BBQBULK: This is full off-heap bulk scoring where vectors and corrections are SIMD-optimized Java Vector API functions reading directly from MMAP files.
  • BBQBulkNative: This is what’s in Elasticsearch 9.4. Full native bulk vector operations reading bytes directly from the index files.

These results show the evolution of throughput, starting with the bare minimum of individual floating point operations. Switching to SIMD (hand-optimized with the Vector API) for floating point increases throughput ~3x, but even then, it's slower than the auto-vectorized individual bit-wise operations in BBQ. Then, the switch to bulk scoring almost increases BBQ throughput by 2x. Adding our new optimized native SIMD kernels in Elasticsearch 9.4, we get yet another significant improvement, adding up to almost 3x improvement from individual bitwise scoring and an incredible 66x improvement over float32 operations.

DiskBBQ vector search performance: what native SIMD means in practice

Native SIMD kernels in Elasticsearch 9.4 deliver nearly 3x faster BBQ vector scoring and 66x the throughput of float32 operations. We aren’t done; there are always ways to improve, but this native SIMD block scoring pattern works well on all CPU architectures, allowing DiskBBQ to be fast no matter where it's deployed.

Wie hilfreich war dieser Inhalt?

Nicht hilfreich

Einigermaßen hilfreich

Sehr hilfreich

Zugehörige Inhalte

Sind Sie bereit, hochmoderne Sucherlebnisse zu schaffen?

Eine ausreichend fortgeschrittene Suche kann nicht durch die Bemühungen einer einzelnen Person erreicht werden. Elasticsearch wird von Datenwissenschaftlern, ML-Ops-Experten, Ingenieuren und vielen anderen unterstützt, die genauso leidenschaftlich an der Suche interessiert sind wie Sie. Lasst uns in Kontakt treten und zusammenarbeiten, um das magische Sucherlebnis zu schaffen, das Ihnen die gewünschten Ergebnisse liefert.

Probieren Sie es selbst aus