Recently we've made some big changes to Apache Lucene around how doc values are indexed and searched, including new nightly benchmarks to measure our progress, based on the New York City taxi ride data corpus. These changes fix doc values so you only pay for what you use, just like all other parts of the Lucene index.
These changes will be in Lucene's next major release (7.0) and will likely not be back-ported to any 6.x release, so it will be some time until Elasticsearch exposes this.
What are doc values?
Doc values are Lucene's column-stride field value storage, letting you store numerics (single- or multi-valued), sorted keywords (single or multi-valued) and binary data blobs per document. These values are quite fast to access at search time, since they are stored column-stride such that only the value for that one field needs to be decoded per hit. This is in contrast to Lucene's stored document fields, which store all field values for one document together in a row-stride fashion, and are therefore relatively slow to access.
Doc values can be used to hold scoring signals, e.g. using expressions to combine multiple signals into a score, or for sorting,
grouping,
aggregations,
possibly
for block joins in the future
, etc. Indeed Lucene's default
scoring signal,
norms, a single byte
per text
field/document holding index-time scoring signals (the field's length
and boost, quantized) used to implement
our
BM25
scoring
, is also just a doc values field.
Plumbing first
The changes began
with
this
issue
, which was "simply" a low-level raw plumbing change
switching out how doc values are accessed at search time from the
previous random-access API to an iterator API instead. You can see it
as annotation
H
in
the
new
nightly benchmarks
. Because an iterator API is a much more
restrictive access pattern than an arbitrary random access API, this
change gives codecs more freedom to use aggressive compression and
other optimizations, like our postings implementations do.
That change was already massive enough that we decided to break out
all such codec improvements to future issues. Since this was a
plumbing change only,
we
lost
some search-time performance
(see annotation BU
) at
first as the existing Lucene codec had to use temporary silly wrapper
classes to translate its random-access API into an iterator API.
Sparse cases, where not all documents have a value for each doc values
field, should especially benefit from this change. In the past,
Lucene's non-sparse encoding of such fields has been particularly
unnerving, especially when you call
forceMerge
and
see the size of your index suddenly grow 10-fold! But even dense
cases, where most documents have a value for the field, should see
performance gains as well, since the more restrictive API gives codecs
more compression freedom.
Codec Improvements
Fortunately, after the initial plumbing change, Adrien Grand worked hard to improve our default codec to take advantage of the more restrictive iterator APIs:
- Create a new 7.0 codec so we are free to make major changes to the index format
- Remove layers of abstraction for the common single-valued numerics and binary cases , norms and when GCD compression is used (e.g. date/time fields)
- Implement sparse cases directly in the 7.0 codec, including norms
- Switch
to a sparse encoding in
IndexWriter
to buffer doc values in memory before writing to disk - Add
a new
advanceExact
API to get faster skipping (without an implicitnextDoc
) to a target document
These changes brought back much of our search performance on the dense use cases tested by Lucene's existing nightly benchmarks, and in some cases improved performance beyond where we were previously .
With all these improvements, and I'm sure many more to come, we finally get Lucene's doc values to a point where only pay for what you use, just like the rest of Lucene's index parts.
Sparse benchmarks
Along with these Lucene improvements we've added a new set of nightly Lucene benchmarks on sparse and dense documents to track our progress and any accidental performance regressions . As usual, the sources to run this benchmark are available here , and pull requests are welcome!
The benchmarks index a 20 M document subset from the New York City taxi ride data corpus . Each document represents a single (anonymous!) taxi ride in New York City, either via a green or yellow cab. Green taxi rides are about 11.5% and yellow taxi rides are around 88.5%, making a good test for mostly sparse and mostly dense fields, vs. 100% dense fields.
We index the same set of documents in three different ways:
-
sparse
, where green and yellow taxi rides have their own fields (e.g., green_fare_amount and yellow_fare_amount) -
dense
, where all rides share a single set of fields and all documents have exactly the same fields (100% fields are set) -
sparse-sorted
, with index time sorting by cab color
We also run a few basic searches over those three indices,
including
TermQuery
on cab color,
both
sorting
by a numeric field
and sorting
by (degenerate) score
, a two-clause BooleanQuery (cab color
green or yellow)
which matches all documents, and
a
numeric points range filter.
Back testing, where we checkout the master sources at a specific point in time in the past and then run the benchmark, was particularly tricky since it required linearizing the git commit history to match which git commit hash the master branch pointed to at different times in the past, something git (amazingly) seems not to preserve.
Benchmark results
As typically happens when one creates a new benchmark, there are many curious WTFs ("wow that's funny"s) jumps and drops that need explaining, but here are some clear initial observations:
- The indexing metrics improve nicely over time: index size drops; indexing rate increases; docs per MB increases; flush and merge times drop
- Index-time sorting makes indexing ~50% slower, but gives
impressive
search
speedups for queries like
BooleanQuery
, a worthwhile tradeoff for many users; evenCheckIndex
time is substantially faster with index sorting - Index-time sorting causes points to use more search-time heap because it prevents this optimization ; maybe we can fix the optimization to apply to sorted indices as well (pull requests welcome!)
- This
change
, to improve/simplify how
MMapDirectory
tries to prevent access to a closed file, madeTermQuery
andBooleanQuery
faster as it reduced the cost of cloningIndexInput
which is done many times for one query - Annotations
H
andI
on the TermQuery, sorted by longitude chart make it clear how much fasterMMapDirectory
is thanNIOFSDirectory
Those last two annotations were important! Here we see an accidental
performance regression, caused by
a
trivial
bug fix that otherwise passed Lucene's numerous unit tests
, but
inadvertently caused Lucene to pick
NIOFSDirectory
instead
of
MMapDirectory
. This is a stark reminder of why
automated benchmarks are so important! We fixed it in
annotation
I
, just in time for Lucene's 6.3.0 release.