Multi-dimensional points, coming in Apache Lucene 6.0

Coming in Lucene's next major release (6.0) is a new feature calledย dimensional points, using the k-dย treeย geo-spatialย data structure to offer fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering. As of this writing, Elasticsearch has notย yet exposed points, but I expect that will changeย soon.

This feature replaces theย now deprecatedย numeric fields and numeric range query since it has better overall performance and is more general, allowing up to 8 dimensions (versus 1) and up to 16 bytes (versus theย 8 byte limitย today) per dimension.

Theย k-dย tree variant we implemented is theย block k-dย treeย which is specifically designed for efficient IO, such that most of the data structure resides in on-disk blocks, with a small in-heap binary tree index structure to locate the blocks at search time.

This means you will finally be able to use Lucene to efficiently index and range-filter anything that can be encoded as fixed-length, ordered ย byte[], such asย IPv6ย InetAddress,ย BigInteger,ย BigDecimal, etc., along with 2D and 3D (and higher!) geo-spatial indices, and times-series values.

k-dย trees

Block k-dย trees are a simple yet powerfulย data structure. At index time, they are built by recursively partitioning the full space of N-dimensional pointsย to be indexed into smaller and smaller rectangular cells, splitting equally along the widest ranging dimension at each step of the recursion. However, unlike an ordinary k-d tree, a block k-d tree stops recursing once there are fewer than a pre-specified (1024 in our case, by default) number of points in the cell.

At that point, allย points within that cell are written into oneย leaf blockย on disk and the starting file-pointer for that block is saved into an in-heap binary tree structure. In the 1D case, this is simply a full sort of all values, divided into adjacent leaf blocks. There are k-d treeย variants that can support removing values, and rebalancing, but Lucene does not need these operations because of its write-once per-segment design.

At search time, the same recursion takes place, testing at each level whether the requested query shape intersects the left or right sub-tree of each dimensionalย split, and recursing if so. In the 1D case, the query shape is simply a numeric rangeย whereas in the 2D and 3D cases, it is a geo-spatial shape (circle, ring, rectangle, polygon, cube,ย etc.).

Here is aย video showing how the leafย blocks are visited to find all 2D (latitude/longitude) pointsย inside the London, UK polygon based on theย PlanetOSMย data:ย 

Once the recursion ends at a leaf block, if the cell overlaps the shape's boundary (blue cells) then each full-precision point in that block is tested against the shape. This check ("does the queryย shape contain this point?") may be somewhat costly since it is computed per-hit against a possibly complex shape, however it is only done for those leaf cells overlapping the boundary of the shape. If instead the leaf block is fully contained inside the query shape (the pink cells), the documents with values in that cell are efficiently collected without having to test each point.

K-d trees are fast because they naturally adapt to each data set's particular distribution, using small leaf blocks where the indexed values are dense: notice howย central London, where there is a higher density of points, is assignedย smaller leaf cells. K-d trees also naturally find the right tradeoff of how deeply to recurse, by splitting up the dimensionalย space, versusย at what point simply scanning the full-precisionย values in one cell is appropriate. ย This isย in contrast toย legacy numeric fields which always index the same precision levels for every value (4 terms for a long value, by default)ย regardless of how the points are distributed.ย 

5th_dimension.jpeg

Lucene's implementation

Each indexed value, for exampleย a singleย IntPointย added to your document, is translated intoย a fixed-lengthย byte[]ย for a specific number of dimensions. There are classes for each native type (LongPoint, FloatPoint, etc.) that handle converting values of that type to the matching byte[].ย For a given field name, all documents in the index must have the same number of dimensions and sameย byte[]ย length across dimensions. Multi-valued fields are allowed, by adding the same field name multiple times in one document.

Lucene'sย IndexWriterย buffers the pointsย you have indexed, and then usesย PointFormat, a newly added codec component, to write the values to a new segment. ย Theย default codecย supports points, butย theย simple text codecย doesย as well (it was the first implementation!), so you can see the leaf blocks in a simple, human-readable plain text file if you are curious (do not use it in production!).

In developing the feature there were also some exciting low-level infrastructure improvements required, including switching our offline sorter toย use Lucene'sย Directoryย abstractionย instead of direct filesystem IO,ย adding a new Directory.createTempOutputย method to create new temporary files for writingย andย moving delete retry (on Windows) responsibility underย Directory.

There are already several Lucene geo-spatial queries that are based on dimensional values, including the sandbox geo spatial queries (PointInRectQuery,ย PointInPolygonQuery,PointDistanceQueryย coming soonish), and theย spatial3dย module indexes and searches 3D (x, y, z) dimensional values. In Lucene's core there is also PointRangeQuery, to filter by an N-dimensional box, and ExactPointQuery to query for exactly a single point.

2D Performance

To assess the 2D performance impact, I createdย a simple benchmark (allย sources are here) toย indexย and filterย a 60.8 Million latitude/longitude points subset of the PlanetOSMย data set generously provided by the OpenStreetMaps.org foundation, querying with a set of regularly spaced varying sized rectangles around London, UK. Iย recorded these four metrics:

  • Time to build the index, using a single thread andย SerialMergeScheduleย  with LogDocMergePolicy and fixed document count written to each segment,ย to keep comparisons fair, since the resulting indices haveย exactly the same segment structure
  • Size on disk of the resulting index
  • Heap size of the in-memory search-time indexย structures
  • Query time (average msec for 225 queries, best across 50 iterations)

I compare the legacy spatial module, the recently added GeoPoint queries (new in Elasticsearch 2.2.0 with further improvements coming in Elasticsearch 2.3.0 and included in these tests) and the new dimensional points, coming in Lucene 6.0.0:

google.charts.load('current', { packages: ['corechart', 'bar'] }); google.charts.setOnLoadCallback(drawQueryTime); function drawQueryTime() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Query time (msec)', { role: 'style' }], ['Points', 13.5, '#e62739'], ['Spatial', 17.6, '#9068be'], ['Geopoint', 26.9, '#6ed3cf'], ]); var options = { title: 'Query time (msec)', chartArea: { width: '50%' }, hAxis: { minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('query_time_div')); chart.draw(data, options); } google.charts.setOnLoadCallback(drawIndexTime); function drawIndexTime() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Index time (sec)', { role: 'style' }], ['Points', 443.3, '#e62739'], ['Spatial', 1480.8, '#9068be'], ['Geopoint', 87.5, '#6ed3cf'], ]); var options = { title: 'Index time (msec)', chartArea: { width: '50%' }, hAxis: { title: 'Index time (msec)', minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('index_time_div')); chart.draw(data, options); } google.charts.setOnLoadCallback(drawIndexSize); function drawIndexSize() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Index size (MB)', { role: 'style' }], ['Points', 630, '#e62739'], ['Spatial', 7987.2, '#9068be'], ['Geopoint', 772, '#6ed3cf'], ]); var options = { title: 'Index size (MB)', chartArea: { width: '50%' }, hAxis: { title: 'Index size (MB)', minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('index_size_div')); chart.draw(data, options); } google.charts.setOnLoadCallback(drawSearchHeap); function drawSearchHeap() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Search time heap (MB)', { role: 'style' }], ['Points', 2.3, '#e62739'], ['Spatial', 238.5, '#9068be'], ['Geopoint', 4.9, '#6ed3cf'], ]); var options = { title: 'Search time heap (MB)', chartArea: { width: '50%' }, hAxis: { title: 'Search time heap (MB)', minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('search_heap_div')); chart.draw(data, options); }

Overall, both GeoPoint and the new dimensional points show substantial improvements across the board over the legacy spatial module, with an especially large reduction in index size and search time heap used, while GeoPointย hasย much faster indexing time than dimensional points, and dimensional points show faster querying time, smaller heap and slightly smaller index size.

Remember that noย benchmark isย perfect, and this one is no exception!ย 

First, it runs only the geo filter in isolation, butย in practice aย geo filterย would normally beย executed along with other MUST clauses whereย optimizations likeย two-phased supportย should have a big impact. ย Second, it tests single-valued documents, while multi-valued cases will likely have different behavior. ย Third, it's a standalone benchmark, so the hotspot compiler gets to unnaturally focus only on the specific indexing and searching code.ย Finally, the query shapes are a set of regularly spacedย rectangles around London, UK, not actual user queries.

1D Performance

To test the 1D case, I indexed only latitude from the same data set, quantized to an integer (1D 4 byte point), and compared dimensional points with the legacyย IntField, using regularly spaced varying sized range filters (just the 1D projection of the 2D test queries):

google.charts.setOnLoadCallback(draw1DQueryTime); function draw1DQueryTime() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Query time (msec)', { role: 'style' }], ['Points', 24.7, '#e62739'], ['Numeric Field', 32.2, '#9068be'], ]); var options = { title: 'Query time (msec)', chartArea: { width: '50%' }, hAxis: { minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('query_time_1d_div')); chart.draw(data, options); } google.charts.setOnLoadCallback(draw1DIndexTime); function draw1DIndexTime() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Index time (sec)', { role: 'style' }], ['Points', 86.8, '#e62739'], ['Numeric Field', 173.9, '#9068be'], ]); var options = { title: 'Index time (msec)', chartArea: { width: '50%' }, hAxis: { title: 'Index time (msec)', minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('index_time_1d_div')); chart.draw(data, options); } google.charts.setOnLoadCallback(draw1DIndexSize); function draw1DIndexSize() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Index size (MB)', { role: 'style' }], ['Points', 364, '#e62739'], ['Numeric Field', 744, '#9068be'], ]); var options = { title: 'Index size (MB)', chartArea: { width: '50%' }, hAxis: { title: 'Index size (MB)', minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('index_size_1d_div')); chart.draw(data, options); } google.charts.setOnLoadCallback(draw1DSearchHeap); function draw1DSearchHeap() { var data = google.visualization.arrayToDataTable([ ['Approach', 'Search time heap (MB)', { role: 'style' }], ['Points', 2.16, '#e62739'], ['Numeric Field', 14.4, '#9068be'], ]); var options = { title: 'Search time heap (MB)', chartArea: { width: '50%' }, hAxis: { title: 'Search time heap (MB)', minValue: 0 }, vAxis: { }, legend: { position: "none" }, }; var chart = new google.visualization.BarChart(document.getElementById('search_heap_1d_div')); chart.draw(data, options); }

Dimensional points shows faster querying and indexing time, and a substantial reduction in search time heap used and index size.

The indexing time in particular benefited from aย specialized bulk-merging implementation. I suspect a similar optimization may be possible in the 2D case but it has not yet been explored (patches welcome!).

Moving forwards

Please keep in mind that as of this writing, dimensional pointsย are not yet released, so if you play with this new feature, using a snapshot buildย fromย Lucene's master branch, index file formats and APIs are free to suddenly and drastically change up until when 6.0 is released (hopefully coming soon). We would love toย hear any feedback you may have!

Note that the dimensions need not all be spatial! One exciting potential use case is mixing geo-spatial dimensions with other dimensional values, such as median household income, into a single 3D dimensional index. This would then allow fast range filters against both median household income and arbitraryย geo-spatial shapes. Another example is indexing time as one dimension along with other numeric dimensions to support business-intelligence use cases.

Currently, dimensional values must be single points, but some geo-spatial use cases require indexing shapes as well, which can be done with a generalization of the k-d tree called an R-Tree,ย for example,ย and I expect that will be a future improvement to Lucene (patches welcome again!).

[The image on the top was generated by theย JavaGeomย project]