Making Lucene Faster with Vectorization and FFI/madvise

Over in Lucene-land we've been eagerly adopting features of new Java versions. These features bring Lucene closer to both the JVM and the underlying hardware, which improves performance and stability. This keeps Lucene modern and competitive.

The next major release of Lucene, Lucene 10, will require a minimum of Java 21. Let's take a look at why we decided to do this and how it will benefit Lucene.

Foreign Memory

For efficiency reasons, indices and their various supporting structures are stored outside of the Java heap - they are stored on-disk and mapped into the process' virtual address space. Until recently, the way to do this in Java is with direct byte buffers, which is exactly what Lucene has been doing.

Direct byte buffers have some inherent limitations. For example, they can address a maximum of 2GB, requiring more structures and code to span over larger sizes. However, most significant is the lack of deterministic closure, which we workaround by calling Unsafe::invokeCleaner to effectively close the buffer and release the memory. This is, as the name suggests, inherently an unsafe operation. Lucene adds safeguards around this, but, by definition, there is still a miniscule risk of failure if memory were to be accessed after it was released.

More recently Java has added MemorySegment, which overcomes the limitations that we encounter with direct byte buffers. We now have safe deterministic closure and can address memory far beyond that of previous limits. While Lucene 9.x already has optional support for a mapped directory implementation backed by memory segments, the upcoming Lucene 10 drops support for byte buffers. All this means that Lucene 10 only operates with memory segments, so is finally operating in a safe model.

Foreign Function

Different workloads; search or indexing, or different types of data, say, doc values or vector embeddings, have different access patterns. As we've seen, because of the way Lucene maps its index data, interaction with the operating system page cache is crucial to performance.

Over the years a lot of effort and consideration has been given to optimizations around memory usage and page cache. First through native JNI code that calls madvise directly, and later with a directory implementation that uses direct I/O. However, while good at the time, both these solutions are a little less than ideal. The former requires platform specific builds and artifacts, and the latter leverages an optional JDK-specific API. For these reasons, neither solution is part of Lucene core, but instead lives in the further afield misc module. Mike McCandless has a good blog about this, from 2010!

On modern Java we can now use the Panama Foreign Function Interface (FFI) to call library functions native on the system. We use this, directly in Lucene core, to call posix_madvise from the Standard C library - all from Java, and without the need for any JNI code or non-standard features. With this we can now advise the system about the type of memory access patterns we intend to use.

Vectorization

Parallelism and concurrency, while distinct, often translate to "splitting a task so that it can be performed more quickly", or "doing more tasks at once". Lucene is continually looking at new algorithms and striving to implement existing ones in more performant and efficient ways. One area that is now more straightforward to us in Java is data level parallelism - the use of SIMD (Single Instruction Multiple Data) vector instructions to boost performance.

Lucene is using the latest JDK Vector API to implement vector distance computations that result in efficient hardware specific SIMD instructions. These instructions, when run on supporting hardware, can perform floating point dot product computations 8 times faster than the equivalent scalar code. This blog contains more specific information on this particular optimization.

With the move to Java 21 minimum, it is a lot more straightforward to see how we can use the JDK Vector API in more places. We're even experimenting with the possibility of calling customized SIMD implementations with FFI, since the overhead of the native call is now quite minimal.

Conclusion

While the latest Lucene 9.x releases are able to benefit from many of the recent Java features, the requirement to run on versions of Java as early as Java 11 means that we're reaching a level of complexity with 9.x that, while maybe still ok today, is not where we want to be in the future.

The upcoming Lucene 10 will be closer to the JVM and the hardware than ever before. By requiring a minimum of Java 21, we are able to drop the older direct byte buffer directory implementation, reliably advise the system about memory access patterns through posix_madvise, and continue our efforts around levering hardware accelerated instructions.

Ready to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join the Elasticsearch Engineer training starting soon!
Recommended Articles