2023 just came to an end, and it's been another active year for Apache Lucene development. Let's take some time to review highlights from last year.
Community
In 2023, there have been:
- 5 minor releases (9.5, 9.6, 9.7, 9.8 and 9.9),
- 1 patch release (9.9.1),
- 1 new committer,
- 4 new PMC members,
- 620 commits from 97 unique contributors.
Vector search
The promise of truly semantic search for retrieval and retrieval augmented generation appeals a lot to users, large and small. So it's no surprise that vector search has been a major theme for Apache Lucene in 2023. More specifically, many interesting features and optimizations have been added across several releases:
- Support for int8 vectors. (Lucene 9.5)
- Faster merging of HNSW graphs. (Lucene 9.6)
- Faster indexing, merging and querying through support for vectorization (Lucene 9.7) and FMA (Lucene 9.9).
- Support for combining vector search with block joins. (Lucene 9.8)
- Support for auto int8 scalar quantization of vectors at index time. (Lucene 9.9)
Radix sort everywhere
Indexing is about organizing data in such a way that it can be efficiently accessed at search time, which involves a lot of sorting in practice. And when it comes to sorting, radix sort is king (when applicable!). Lucene had already been using radix sort in a few performance-sensitive places for a while, such as sorting the terms dictionary of segments. But usage of radix sort further increased in 2023, and it began being used to optimize:
- applying deletes,
- sorting postings when index sorting is enabled,
TermInSetQuery
construction,- index reordering.
Faster query evaluation
We already covered some performance improvements for vector search, but keyword search saw major speedups as well in 2023. Check out this blog, which covers major speedups that occurred across the 9.7, 9.8 and 9.9 releases. These improvements apply both to traditional keyword search and sparse vector search, such as created by learned sparse retrieval models.
Closer integration with the Java virtual machine
As a Java library, Lucene relies a lot on the Java virtual machine (JVM), and once in a while new features get released that are especially interesting for Lucene. Two features in particular have been integrated in such a way that if you run on a modern enough version of the JVM, then they will be used automatically:
- The Panama vector API is used to speed up vector comparisons, such as computing the cosine similarity or the square distance between two vectors.
- The Panama
MemorySegment
API is an improved API to mmap files into memory.
It's hard to draw a line, but I'll stop here as I struggle to find common themes for other good changes I'm looking at that happened in 2023. :) Stay tuned for a great year 2024 in Apache Lucene land!
Ready to try this out on your own? Start a free trial.
Elasticsearch and Lucene offer strong vector database and search capabilities. Dive into our sample notebooks to learn more.
Related content
December 7, 2023
Apache Lucene 9.9, the fastest Lucene release ever
Lucene 9.9 brings major speedups to query evaluation. Here are the performance improvements observed in nightly benchmarks & optimization resources.
June 26, 2024
Elasticsearch vs. OpenSearch: Vector Search Performance Comparison
Elasticsearch is out-of-the-box 2x–12x faster than OpenSearch for vector search
September 1, 2023
Bringing maximum-inner-product into Lucene
Explore how we brought maximum-inner-product into Lucene and the investigations undertaken to ensure its support.
April 25, 2024
Understanding Int4 scalar quantization in Lucene
This blog explains how int4 quantization works in Lucene, how it lines up, and the benefits of using int4 quantization.
April 17, 2024
Making Lucene faster with vectorization and FFI/madvise
Discover how modern Java features, including vectorization and FFI/madvise, are speeding up Lucene's performance.