What happened in Lucene land in 2023?

2023 just came to an end, and it's been another active year for Apache Lucene development. Let's take some time to review highlights from last year.

Community

In 2023, there have been:

5 minor releases (9.5, 9.6, 9.7, 9.8 and 9.9),
1 patch release (9.9.1),
1 new committer,
4 new PMC members,
620 commits from 97 unique contributors.

Vector search

The promise of truly semantic search for retrieval and retrieval augmented generation appeals a lot to users, large and small. So it's no surprise that vector search has been a major theme for Apache Lucene in 2023. More specifically, many interesting features and optimizations have been added across several releases:

Support for int8 vectors. (Lucene 9.5)
Faster merging of HNSW graphs. (Lucene 9.6)
Faster indexing, merging and querying through support for vectorization (Lucene 9.7) and FMA (Lucene 9.9).
Support for combining vector search with block joins. (Lucene 9.8)
Support for auto int8 scalar quantization of vectors at index time. (Lucene 9.9)

Radix sort everywhere

Indexing is about organizing data in such a way that it can be efficiently accessed at search time, which involves a lot of sorting in practice. And when it comes to sorting, radix sort is king (when applicable!). Lucene had already been using radix sort in a few performance-sensitive places for a while, such as sorting the terms dictionary of segments. But usage of radix sort further increased in 2023, and it began being used to optimize:

Faster query evaluation

We already covered some performance improvements for vector search, but keyword search saw major speedups as well in 2023. Check out this blog, which covers major speedups that occurred across the 9.7, 9.8 and 9.9 releases. These improvements apply both to traditional keyword search and sparse vector search, such as created by learned sparse retrieval models.

Closer integration with the Java virtual machine

As a Java library, Lucene relies a lot on the Java virtual machine (JVM), and once in a while new features get released that are especially interesting for Lucene. Two features in particular have been integrated in such a way that if you run on a modern enough version of the JVM, then they will be used automatically:

The Panama vector API is used to speed up vector comparisons, such as computing the cosine similarity or the square distance between two vectors.
The Panama MemorySegment API is an improved API to mmap files into memory.

It's hard to draw a line, but I'll stop here as I struggle to find common themes for other good changes I'm looking at that happened in 2023. :) Stay tuned for a great year 2024 in Apache Lucene land!

Ready to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join the Elasticsearch Engineer training starting soon!