Apache Lucene 8 was released a few weeks ago with lots of exciting new features and improvements. Here's a look at some of the highlights:
When executing a search in Lucene 7, the scoring code will visit every document that matches the query, yielding both the top k highest scoring hits and an accurate count of the number of documents that matched. In many circumstances, an accurate count is unnecessary, and for queries that match a large number of documents, significant time is spent counting and scoring documents that will not end up in the top hits. Lucene 8 introduces a new API that allows you to opt out of this counting, returning instead a lower bound of the number of documents that match. This allows the introduction of a number of shortcuts, speeding up query execution.
The idea that kicked off all these query speedups was first proposed back in 2012, and involves adding new information to the index, making it possible to calculate maximum scores for blocks of documents.
In general, the values that contribute to a document's score for any given query can be split into global factors (things like the total term frequency or average document length), and per-document per-term factors, known as impacts. These take the form of a pair of numbers, the length of the document (compressed down into a single byte, known as a ‘norm'), and the frequency of the term in that document.
Lucene already divides indexing information for any given term into blocks, and builds a parallel structure called a skip list to allow queries to efficiently jump over documents that we know won't match a query. By adding a summary of the highest impacts in a block to that skip list, it's possible to calculate the largest score that could be produced by that block, and to skip over it entirely if the score is not competitive. Skip lists are much smaller and more efficient to decode than the postings lists they refer to, so the ability to avoid reading blocks altogether can yield enormous speedups for queries that touch a lot of documents. More details can be found in this blog post about faster retrieval of top hits.
Faster custom scores
The standard indexing chain stores term frequencies in the impacts list, but an impact is just a pair of numbers, and we can put any information we like in there. Lucene 8 provides a new field type called a FeatureField that uses term frequencies to encode numerical data, and exposes special queries that use this information for scoring. These queries can then implement the same skipping shortcuts as described above, resulting in very efficient custom-scoring queries. Elasticsearch makes these available via the
rank_features fields, as described in this relevance tuning blog.
As well as simple boosts, you can also score by recency or proximity using distance feature queries. These skip non-competitive hits in a slightly different way: we can convert a minimum competitive score into a bounding box that excludes documents which are too far from the origin to make it into the top k hits. Elasticsearch will provide these via the new distance feature query, due to be included in version 7.1.
Jump tables for doc values
Lucene provides a data structure called a
docvalue that allows efficient per-document lookup, used for things like sorting or faceting. When this was first added to Lucene back in version 4.0, it was implemented as a straight look-up table with a fixed size for every entry, which allows for very fast access but has a large footprint on disk. It also makes it difficult to distinguish between documents that have an empty value and documents that have no value at all. To deal with this, Lucene 7 changed the doc values API to use iterators, which allows fields where few documents have values to be stored in much less space. The trade-off here, however, is that access to a particular document's value can be slower, as it's no longer a simple case of looking for the value at a particular address.
To help improve this access speed, Lucene 8 adds jump tables to the default codec, similar to the skip lists described above. This makes looking up a value for a particular document much faster, as rather than decoding blocks of values in sequence until the correct document is reached, we can now jump directly to the correct block.
Moving the terms dictionary off-heap
Reading large index files can use a lot of memory. Lucene tries to delegate as much of this memory management to the operating system as it can by loading index data using memory mapping. Frequently, queried data will naturally stay longer in the page cache, while freeing up Java heap memory to be used by your actual application.
In addition to postings lists and docvalues tables, Lucene 8 now allows the index of terms dictionaries to be loaded off-heap as well. The trade-off here is that in exchange for more heap memory, terms dictionary lookups are slightly slower. In most cases that won't matter, because looking up a term is typically a very small part of a query, but it can have a big effect on primary key lookups, which are used frequently when indexing updates. The default codec mitigates this by keeping these indexes on-heap if each entry only has a single document associated with it. It also only enables these changes if index data is being read via a memory-mapped directory.
Work on Lucene doesn't stop just because a new version has been released. Plenty of improvements have already been committed for a future version 8.1, including more speed-ups for constant score and geo-queries and improvements to memory management. For more information about updates, or to report bugs, you can subscribe to the lucene-dev mailing list.