04 août 2016 Technique

Searching numb3rs in 5.0

By Adrien Grand

Lucene 6.0 introduced new exciting feature called multi-dimensional points. While the name of the feature puts a lot of emphasis on the fact that it supports multiple dimensions, this feature behaves more than decently in the case of a single dimension and actually better than the trie encoding that was used in previous versions of Lucene. This caused us to refactor number-based fields (byte, short, integer, long, float, double, ip and date) to use this new data-structure for indexing numbers.

How does it work?

The current data-structure that underlies dimensional points is called a block KD tree, which in the case of a single dimension is a simple binary search tree that stores blocks of values on the leaves rather than individual values. Lucene currently defaults to having between 512 and 1024 values on the leaves.

This is quite similar to the b-trees that are used in most regular databases in order to index data. In the case of Lucene, the fact that files are write-once makes the writing easy since we can build a perfectly balanced tree and then not worry about rebalancing since the tree will never be modified. Merging is easy too since you can get a sorted iterator over the values that are stored in a segment, then merge these iterators into a sorted iterator on the fly and build the tree of the merged segment from this sorted iterator.

Comparison with the old number implementation

While Mike already reported that 1D points prove faster to index, faster to search, more disk-efficient and more memory-efficient than the old implementation, things got even better with recent optimizations in Lucene 6.1 and the upcoming 6.2:

  • The collector for matching doc ids got optimizations so that collecting the set of documents that match a query is now faster.
  • The disk format now takes better advantage of the fact that values in a leaf block are close of each other for compression.
  • Each value in the tree is stored next to the doc ID of the document that this value belongs to. These doc IDs are now compressed based on the number of documents in the segment rather than using 4 bytes all the time.
  • While merging 1D points is cheap since trees can be merged with a simple merge sort, flush is more costly as the tree needs to be built from an unordered set of values. But this recently got a speed up by sorting directly in the IndexWriter buffer and switching to radix sort as a sort algorithm.

So I re-ran the same benchmark as in the original post about points to get a more up-to-date comparison in the case of a single dimension.

Query time (msec)010203040PointsNumeric Field
ApproachQuery time (msec)
Points25
Numeric Field39.3
Index time (msec)060120180240PointsNumeric FieldIndex time (msec)
ApproachIndex time (sec)
Points62.3
Numeric Field217
Index size (MB)0200400600800PointsNumeric FieldIndex size (MB)
ApproachIndex size (MB)
Points251
Numeric Field744
Search time heap (MB)0481216PointsNumeric FieldSearch time heap (MB)
ApproachSearch time heap (MB)
Points2.12
Numeric Field14.4
For this particular data and set of queries, points proved 36% faster at query time, 71% faster at index time and used 66% less disk and 85% less memory. Numbers may differ based on the data and queries, but we are confident that points will perform much better on average than the current implementation, especially in the case of very high-cardinality fields.

New number types

Beats is pushing the limits of Elasticsearch when it comes to numbers. For time series use-cases in general, users want to be able to index as fast as possible using as little disk space as possible. While integer types are easy to compress based on the number of bits that they actually use (for instance if all integers are between 2 and 200, they can be encoded on one byte), the same is not true for floating-point data. In particular, the fact that they cannot represent decimals accurately makes them use all available bits in order to be as close as possible to the value that needs to be represented. This makes compression hard since values in a given segment rarely have a pattern that can be leveraged for compression.

But do you actually need the accuracy of a float or a double for your data, or would you happily trade some precision for reduced disk usage and faster queries? For those to whom this sounds like a logical trade-off, we introduced two new field types called half_float and scaled_float.

Half floats work the same way as floats and doubles, except that they use fewer bits for the mantissa and the exponent, allowing them to be stored on 16 bits in total, rather than 32 in the case of floats and 64 in the case of doubles. However beware that this storage reduction comes at a cost: they only have 3.3 significant decimal digits and the maximum value is 65504.

We suspect the new scaled_float field will be even more useful in practice: it takes a scaling factor and for every value, it internally stores the long that is closest to the product of the value and the scaling factor. For instance, with a scaling factor of 100, 0.123 would internally be indexed as 12 and all queries and aggregations would behave as if the value was actually 0.12. This helps because even though values are decimal, they are internally encoded as integers, which Lucene can compress more efficiently.

Conclusion

Number support improved greatly in the upcoming 5.0 release. Feel free to give a try to the alpha releases if you want to check out these improvements on your data.