Recently we've made some big changes to Apache Lucene around how doc values are indexed and searched, including new nightly benchmarks to measure our progress, based on the New York City taxi ride data corpus. These changes fix doc values so you only pay for what you use, just like all other parts of the Lucene index.
These changes will be in Lucene's next major release (7.0) and will likely not be back-ported to any 6.x release, so it will be some time until Elasticsearch exposes this.
What are doc values?
Doc values are Lucene's column-stride field value storage, letting you store numerics (single- or multi-valued), sorted keywords (single or multi-valued) and binary data blobs per document. These values are quite fast to access at search time, since they are stored column-stride such that only the value for that one field needs to be decoded per hit. This is in contrast to Lucene's stored document fields, which store all field values for one document together in a row-stride fashion, and are therefore relatively slow to access.
Doc values can be used to hold scoring signals, e.g. using expressions to combine multiple signals into a score, or for sorting,
for block joins in the future
, etc. Indeed Lucene's default
norms, a single
byte per text
field/document holding index-time scoring signals (the field's length
and boost, quantized) used to implement
, is also just a doc values field.
The changes began
, which was "simply" a low-level raw plumbing change
switching out how doc values are accessed at search time from the
previous random-access API to an iterator API instead. You can see it
. Because an iterator API is a much more
restrictive access pattern than an arbitrary random access API, this
change gives codecs more freedom to use aggressive compression and
other optimizations, like our postings implementations do.
That change was already massive enough that we decided to break out
all such codec improvements to future issues. Since this was a
plumbing change only,
some search-time performance
first as the existing Lucene codec had to use temporary silly wrapper
classes to translate its random-access API into an iterator API.
Sparse cases, where not all documents have a value for each doc values
field, should especially benefit from this change. In the past,
Lucene's non-sparse encoding of such fields has been particularly
unnerving, especially when you call
see the size of your index suddenly grow 10-fold! But even dense
cases, where most documents have a value for the field, should see
performance gains as well, since the more restrictive API gives codecs
more compression freedom.
Fortunately, after the initial plumbing change, Adrien Grand worked hard to improve our default codec to take advantage of the more restrictive iterator APIs:
- Create a new 7.0 codec so we are free to make major changes to the index format
- Remove layers of abstraction for the common single-valued numerics and binary cases , norms and when GCD compression is used (e.g. date/time fields)
- Implement sparse cases directly in the 7.0 codec, including norms
to a sparse encoding in
IndexWriterto buffer doc values in memory before writing to disk
advanceExactAPI to get faster skipping (without an implicit
nextDoc) to a target document
These changes brought back much of our search performance on the dense use cases tested by Lucene's existing nightly benchmarks, and in some cases improved performance beyond where we were previously .
With all these improvements, and I'm sure many more to come, we finally get Lucene's doc values to a point where only pay for what you use, just like the rest of Lucene's index parts.
Along with these Lucene improvements we've added a new set of nightly Lucene benchmarks on sparse and dense documents to track our progress and any accidental performance regressions . As usual, the sources to run this benchmark are available here , and pull requests are welcome!
The benchmarks index a 20 M document subset from the New York City taxi ride data corpus . Each document represents a single (anonymous!) taxi ride in New York City, either via a green or yellow cab. Green taxi rides are about 11.5% and yellow taxi rides are around 88.5%, making a good test for mostly sparse and mostly dense fields, vs. 100% dense fields.
We index the same set of documents in three different ways:
sparse, where green and yellow taxi rides have their own fields (e.g., green_fare_amount and yellow_fare_amount)
dense, where all rides share a single set of fields and all documents have exactly the same fields (100% fields are set)
sparse-sorted, with index time sorting by cab color
We also run a few basic searches over those three indices,
TermQuery on cab color,
by a numeric field
by (degenerate) score
, a two-clause BooleanQuery (cab color
green or yellow)
which matches all documents, and
numeric points range filter.
Back testing, where we checkout the master sources at a specific point in time in the past and then run the benchmark, was particularly tricky since it required linearizing the git commit history to match which git commit hash the master branch pointed to at different times in the past, something git (amazingly) seems not to preserve.
As typically happens when one creates a new benchmark, there are many curious WTFs ("wow that's funny"s) jumps and drops that need explaining, but here are some clear initial observations:
- The indexing metrics improve nicely over time: index size drops; indexing rate increases; docs per MB increases; flush and merge times drop
- Index-time sorting makes indexing ~50% slower, but gives
speedups for queries like
BooleanQuery, a worthwhile tradeoff for many users; even
CheckIndextime is substantially faster with index sorting
- Index-time sorting causes points to use more search-time heap because it prevents this optimization ; maybe we can fix the optimization to apply to sorted indices as well (pull requests welcome!)
, to improve/simplify how
MMapDirectorytries to prevent access to a closed file, made
BooleanQueryfaster as it reduced the cost of cloning
IndexInputwhich is done many times for one query
Ion the TermQuery, sorted by longitude chart make it clear how much faster
Those last two annotations were important! Here we see an accidental
performance regression, caused by
bug fix that otherwise passed Lucene's numerous unit tests
inadvertently caused Lucene to pick
MMapDirectory. This is a stark reminder of why
automated benchmarks are so important! We fixed it in
I, just in time for Lucene's 6.3.0 release.