How we built Elasticsearch simdvec to make vector search one of the fastest in the world

How we built Elasticsearch simdvec, the hand-tuned SIMD kernel library behind every vector search query in Elasticsearch.

From vector search to powerful REST APIs, Elasticsearch offers developers the most extensive search toolkit. Dive into our sample notebooks in the Elasticsearch Labs repo to try something new. You can also start your free trial or run Elasticsearch locally today.

Elasticsearch simdvec is the engine behind every vector distance computation in Elasticsearch. It provides hand-tuned AVX-512 and NEON kernels for every vector type Elasticsearch supports. Its bulk scoring architecture hides memory latency through explicit prefetching on x86 and interleaved loading on ARM, outperforming libraries like FAISS and jvector by up to 4x when data exceeds CPU cache. In this post, we explain why we built it, what’s inside, and how it makes Elasticsearch vector search one of the fastest in the world.

How we built Elasticsearch simdvec

Every vector search query in Elasticsearch, whether Hierarchical Navigable Small World (HNSW) traversal, inverted file (IVF) scan, or reranking pass, reduces to the same problem: computing distances between vectors, millions of times per query. Elasticsearch supports a wide range of data types and quantization strategies, from float32 to int8, bfloat16, binary, and Better Binary Quantization (BBQ). Each comes with different trade-offs between memory, throughput, and recall. Behind all of it is a single engine: simdvec.

We built simdvec to make every distance computation as fast as the hardware allows. In this post, we explain why we built it, what’s inside, and where it delivers the most impact.

Built like a race car

As Formula 1 enthusiasts, with one of us having previously worked with the Ferrari Formula 1 Team, we see a clear parallel. A Formula 1 car is designed with a single purpose: to achieve the best lap time. Engine power, aerodynamics, and chassis design only matter insofar as they contribute to that outcome. The same is true of a vector database, where indexing throughput, query latency, and recall define success.

While the end result is what matters, reaching the highest levels of performance requires each component to be at its best. It can’t just be good enough, it has to be the best in its category. Simdvec is built with that mindset, focusing on a critical part of the system: the engine. It’s a purpose-built, single instruction multiple data (SIMD) optimized kernel library that provides hand-tuned native C++ distance functions, called from Java via the Panama foreign function interface (FFI). It supports bulk scoring, cache line prefetching, and all vector types and layouts used in Elasticsearch.

That’s the engine behind every query.

Why we built our own

We started in 2023 with the Panama Vector API in Apache Lucene. It worked well for float32 dot products, but Elasticsearch's needs quickly outgrew what it could provide. Elasticsearch supports a wide range of quantized vector types: int8, int4, bfloat16, single-bit, and asymmetric BBQ. Each has different SIMD strategies, packing layouts, and accumulator requirements. Beyond type coverage, Elasticsearch's scoring paths demand more than single-pair throughput: HNSW needs to score several graph neighbors in one pass, IVF needs bulk scoring of thousands of candidates with prefetching, and disk-based scoring needs to work directly on mmap'd memory without copying. We looked at what was available, and nothing covered the full set.

So we built simdvec: hand-tuned native C++ kernels called from Java via FFI, with bulk scoring, prefetching, and support for every vector type Elasticsearch uses. By owning the library, we control the full stack. When we add a new quantization type like BBQ, it gets a tuned SIMD kernel wired all the way through the system. We don't wait for an upstream library to support it, and we don't compromise on performance for any type. Every vector query in Elasticsearch, whether HNSW, IVF, reranking, or hybrid, runs on this engine, built around the operations and types we actually use.

Simdvec has separate native libraries for x86 and ARM, each with multiple instruction set architecture (ISA) tiers selected at startup. The call overhead from Java via FFI is very low at single-digit nanoseconds.

The landscape

We're not the only ones building SIMD-optimized vector distance kernels. The ecosystem is rich, and we wanted to understand how simdvec performs. Not to rank projects, but to provide context and explain where Elasticsearch's engine sits. We selected three projects as reference points, each representing a different approach:

  • jvector: A Java approximate nearest neighbor (ANN) library that uses the Panama Vector API for vectorized distance computation, with optional native C acceleration on x86.
  • FAISS: A widely deployed open source vector search framework, with hand-tuned AVX2/AVX-512 kernels.
  • NumKong (formerly SimSIMD): A comprehensive suite of over 2,000 hand-tuned SIMD kernels spanning distance functions, matrix operations, and geospatial computation.

Each project serves a different purpose and makes different trade-offs. We include reference numbers from them to give context for simdvec's performance on the specific operations that Elasticsearch needs.

How we measure

The simdvec and jvector benchmarks are written in Java with JMH, the standard JVM microbenchmark harness, with FFI overhead included. For NumKong benchmarks and FAISS benchmarks, we wrote small C/C++ harnesses using Google Benchmark, which is the standard C++ microbenchmark framework. Both frameworks report nanoseconds per operation with warmup and iteration calibration. We verified via hardware performance counters that all libraries are using SIMD on both platforms. All the benchmark code is publicly available in the linked GitHub repositories (and, in the case of simdvec, in the elasticsearch repository).

Software: JDK 25.0.2, JMH 1.37, GCC 14, Google Benchmark (latest).

One vector at a time

The most fundamental operation in vector search is computing the distance between two vectors. Every HNSW neighbor evaluation, every IVF candidate score, every reranking comparison reduces to this inner loop.

We measured single-pair throughput at 1024 dimensions on both platforms, starting with float32, the baseline type and the one where the ecosystem is most competitive. We compare simdvec against FAISS and jvector; we excluded NumKong as it uses float64 accumulators for float32, making it 3.2x-5.3x slower (depending on platform), prioritizing numerical precision over throughput. To keep the comparison like-for-like, we benchmark NumKong on int8 instead, where it uses the same accumulator strategy as simdvec.

On x86, FAISS AVX-512 is the fastest single-pair kernel at 23 ns. Simdvec AVX-512 follows at 28 ns, a gap that reflects the FFI call overhead. Both use 512-bit FMA with multi-accumulator unrolling. At the AVX2 level, the two are much closer, 36 ns and 39 ns respectively, both constrained by the 256-bit register and memory load widths. jvector lands at 44 ns using the Java Panama Vector API. Panama generates good SIMD code, but hand-tuned C++ intrinsics retain an edge.

On ARM, simdvec leads at 70 ns, well ahead of jvector at 110 ns and FAISS at 156 ns. Simdvec has hand-tuned NEON kernels for aarch64. Jvector has no native ARM code and relies on Panama. FAISS relies on compiler auto-vectorization rather than explicit NEON intrinsics, which accounts for the wider gap. This reflects a practical advantage of owning the kernel library: when Elasticsearch expanded to Graviton, we added purpose-built NEON kernels. Neither jvector nor FAISS have prioritized ARM native code to the same degree.

But Elasticsearch doesn't only score float32. Int8 quantization reduces memory by 4x, bfloat16 by 2x, and BBQ by 32x. Each type needs its own SIMD strategy, and simdvec provides hand-tuned native kernels for all of them.

Of the libraries we compared, only NumKong has comparable kernels for int8. We measured int8 dot product, squared Euclidean, and cosine at 1024 dimensions.

Int8 single-pair scoring(1024 dimensions, ns/vec op – lower is better)

On both architectures, NumKong is equal or faster at small-to-medium dimensions, where the difference is largely due to lower call overhead (direct C call vs Java FFI). At larger dimensions simdvec catches up, where the more efficient kernel implementation (which uses cascade unrolling) amortizes the call cost: As dimension increases, this gap closes and eventually reverses. Crossover is at dimensions between 768 and 1536, depending on function and architecture.

Despite the slightly higher overhead of Java FFI, simdvec is on par with highly optimized C/C++ libraries. Not only is it the only library with optimized kernels for both float32 and int8, but it also leads on ARM and only slightly behind FAISS on x86 (for float32), and very close to NumKong on both architectures (for int8). And for bfloat16, int4, binary, and BBQ, while alternatives exist, simdvec distinguishes itself through hand-tuned SIMD tailored to each type's data layout.

But a production search engine doesn’t score one vector at a time; it scores thousands per query. The next question is what happens at that scale.

Thousands at a time

Single-pair performance is only part of the picture. What matters in practice is how systems behave under load. A single HNSW query may score hundreds of graph neighbors. An IVF scan may score thousands of posting list entries. A reranking pass may score tens of thousands of candidates. Single-pair throughput matters, but what matters more is how fast you can score many vectors, and how gracefully performance degrades as the working set spills out of CPU caches.

Simdvec provides bulk scoring for every data type. These aren't just loops over single-pair kernels; they use multi-accumulator inner loops that load the query vector once per dimension stride and share it across multiple document vectors, with explicit cache-line prefetching for the next batch. Neither jvector nor FAISS offer an equivalent (at the time of writing). Jvector has no bulk API, so callers score one pair at a time in a loop. FAISS exposes fvec_inner_products_ny, which, at the time of writing, is implemented as a loop over its single-pair distance function with no query amortization or prefetching.

Float32. To measure the impact at the kernel level, we scored a single query against increasing numbers of 1024 dimension float32 document vectors using random access patterns that simulate HNSW-like scattered graph neighbor lookups. The three dataset sizes, 32, 625, and 32,500 vectors, are chosen so the working set exceeds L1, L2, and L3 cache, respectively.

When the data fits in cache, simdvec is the fastest on both platforms, but the margins are modest since kernel arithmetic dominates. The real separation appears as the working set grows beyond L3. On x86, simdvec scores at 95 ns per vector, while FAISS needs 165 ns and jvector 412 ns. On ARM, the pattern is the same: simdvec holds at 162 ns, while FAISS climbs to 347 ns, and jvector to 476 ns. The prefetching and query amortization in simdvec keeps memory latency hidden in a way that a simple loop over single-pair kernels cannot match, and the advantage widens precisely where real search workloads operate, deep in main memory.

Int8. The same pattern holds for quantized types. We measured int8 dot product bulk scoring at 1024 dimensions with dataset sizes chosen to exceed the same L1, L2, and L3 cache boundaries, comparing simdvec's bulk scoring against NumKong single-pair scoring in a loop.

On x86, simdvec is between 1.2x and 1.9x faster, driven by the combination of explicit prefetching and batch processing. On ARM, simdvec wins again (1.7x to 1.9x faster) across all dataset sizes. The advantage comes from batch processing four vectors at a time, providing memory-level parallelism via an interleaved access pattern. In both cases, the most striking result is what happens at the largest dataset size, where it matters the most.

Results for squared distance and cosine show a similar pattern, with speedups of 1.4x to 1.8x for ARM, and of 1.3x to 3.0x for x86 (details here).

When memory matters

Production vector indices typically don't fit in CPU cache. A 10M-vector int8 index at 1024 dimensions is 10GB. Scoring candidates means streaming data from DRAM, and that's where bulk scoring architecture makes the difference.

We used hardware performance counters to measure what happens inside the CPU during bulk scoring and found that hiding memory latency requires two fundamentally different strategies, one per architecture.

On x86, explicit prefetching eliminates cache misses. The bulk kernel processes vectors sequentially, one fully computed before the next, while issuing prefetch instructions for the next batch. Future data is pulled into L1 before the CPU needs it.

On ARM, the same sequential approach performed poorly, even with prefetching. Instead, the bulk kernel interleaves loads from four vectors at every stride position, giving the out-of-order engine four independent memory streams. The CPU is not fetching data any faster, but rather waiting less by always having something else to compute while memory requests are in flight. Detailed analysis can be found in this GitHub issue.

The numbers tell two different stories:

  1. On x86, prefetching turns 139K cache misses into 19K, and instructions per cycle (IPC) more than doubles. The bulk advantage grows with dataset size, from 1.2x in L2 to 2.8x beyond L3, because prefetching hides progressively more expensive DRAM round trips.
  2. On ARM, cache misses barely change. What changes is utilization: Backend stalls drop 40% because the interleaved access pattern keeps the pipeline fed. This advantage stays a consistent 1.8x regardless of dataset size, because memory-level parallelism applies whether data comes from cache or DRAM.

Two architectures, two strategies, one result: At production scale, simdvec keeps the CPU pipeline busy even when vectors are scattered across main memory.

What this means for Elasticsearch users

These kernel-level capabilities compound. A single vector search query may compute millions of distance operations: HNSW graph traversal, candidate scoring, reranking. Across thousands of concurrent queries, nanoseconds per operation translate directly to query latency and cluster throughput. Whether you use float32, int8, bfloat16, or BBQ, whether your index is in memory or on disk, simdvec is the engine underneath, and every one of those operations runs through the same engine, tuned down to the last nanosecond.

The key takeaway is that at production scale, vector search performance isn’t primarily determined by raw SIMD throughput. It’s dominated by how efficiently the system hides memory latency while sustaining compute across millions of small operations.

The simdvec kernels improve with almost every Elasticsearch release. When new quantization types and hardware platforms emerge, they get tuned kernels from day one. And existing types continue to get faster as we refine the implementations that are already shipping.

이 콘텐츠가 얼마나 도움이 되었습니까?

도움이 되지 않음

어느 정도 도움이 됩니다

매우 도움이 됨

관련 콘텐츠

최첨단 검색 환경을 구축할 준비가 되셨나요?

충분히 고급화된 검색은 한 사람의 노력만으로는 달성할 수 없습니다. Elasticsearch는 여러분과 마찬가지로 검색에 대한 열정을 가진 데이터 과학자, ML 운영팀, 엔지니어 등 많은 사람들이 지원합니다. 서로 연결하고 협력하여 원하는 결과를 얻을 수 있는 마법 같은 검색 환경을 구축해 보세요.

직접 사용해 보세요