15 2월 2016 엔지니어링

Multi-dimensional points, coming in Apache Lucene 6.0

By Michael McCandless

Coming in Lucene's next major release (6.0) is a new feature called dimensional points, using the k-d tree geo-spatial data structure to offer fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering. As of this writing, Elasticsearch has not yet exposed points, but I expect that will change soon.

This feature replaces the now deprecated numeric fields and numeric range query since it has better overall performance and is more general, allowing up to 8 dimensions (versus 1) and up to 16 bytes (versus the 8 byte limit today) per dimension.

The k-d tree variant we implemented is the block k-d tree which is specifically designed for efficient IO, such that most of the data structure resides in on-disk blocks, with a small in-heap binary tree index structure to locate the blocks at search time.

This means you will finally be able to use Lucene to efficiently index and range-filter anything that can be encoded as fixed-length, ordered  byte[], such as IPv6 InetAddressBigIntegerBigDecimal, etc., along with 2D and 3D (and higher!) geo-spatial indices, and times-series values.

k-d trees

Block k-d trees are a simple yet powerful data structure. At index time, they are built by recursively partitioning the full space of N-dimensional points to be indexed into smaller and smaller rectangular cells, splitting equally along the widest ranging dimension at each step of the recursion. However, unlike an ordinary k-d tree, a block k-d tree stops recursing once there are fewer than a pre-specified (1024 in our case, by default) number of points in the cell.

At that point, all points within that cell are written into one leaf block on disk and the starting file-pointer for that block is saved into an in-heap binary tree structure. In the 1D case, this is simply a full sort of all values, divided into adjacent leaf blocks. There are k-d tree variants that can support removing values, and rebalancing, but Lucene does not need these operations because of its write-once per-segment design.

At search time, the same recursion takes place, testing at each level whether the requested query shape intersects the left or right sub-tree of each dimensional split, and recursing if so. In the 1D case, the query shape is simply a numeric range whereas in the 2D and 3D cases, it is a geo-spatial shape (circle, ring, rectangle, polygon, cube, etc.).

Here is a video showing how the leaf blocks are visited to find all 2D (latitude/longitude) points inside the London, UK polygon based on the PlanetOSM data: 

Once the recursion ends at a leaf block, if the cell overlaps the shape's boundary (blue cells) then each full-precision point in that block is tested against the shape. This check ("does the query shape contain this point?") may be somewhat costly since it is computed per-hit against a possibly complex shape, however it is only done for those leaf cells overlapping the boundary of the shape. If instead the leaf block is fully contained inside the query shape (the pink cells), the documents with values in that cell are efficiently collected without having to test each point.

K-d trees are fast because they naturally adapt to each data set's particular distribution, using small leaf blocks where the indexed values are dense: notice how central London, where there is a higher density of points, is assigned smaller leaf cells. K-d trees also naturally find the right tradeoff of how deeply to recurse, by splitting up the dimensional space, versus at what point simply scanning the full-precision values in one cell is appropriate.  This is in contrast to legacy numeric fields which always index the same precision levels for every value (4 terms for a long value, by default) regardless of how the points are distributed. 


Lucene's implementation

Each indexed value, for example a single IntPoint added to your document, is translated into a fixed-length byte[] for a specific number of dimensions. There are classes for each native type (LongPoint, FloatPoint, etc.) that handle converting values of that type to the matching byte[]. For a given field name, all documents in the index must have the same number of dimensions and same byte[] length across dimensions. Multi-valued fields are allowed, by adding the same field name multiple times in one document.

Lucene's IndexWriter buffers the points you have indexed, and then uses PointFormat, a newly added codec component, to write the values to a new segment.  The default codec supports points, but the simple text codec does as well (it was the first implementation!), so you can see the leaf blocks in a simple, human-readable plain text file if you are curious (do not use it in production!).

In developing the feature there were also some exciting low-level infrastructure improvements required, including switching our offline sorter to use Lucene's Directory abstraction instead of direct filesystem IO, adding a new Directory.createTempOutput method to create new temporary files for writing and moving delete retry (on Windows) responsibility under Directory.

There are already several Lucene geo-spatial queries that are based on dimensional values, including the sandbox geo spatial queries (PointInRectQueryPointInPolygonQuery,PointDistanceQuery coming soonish), and the spatial3d module indexes and searches 3D (x, y, z) dimensional values. In Lucene's core there is also PointRangeQuery, to filter by an N-dimensional box, and ExactPointQuery to query for exactly a single point.

2D Performance

To assess the 2D performance impact, I created a simple benchmark (all sources are here) to index and filter a 60.8 Million latitude/longitude points subset of the PlanetOSM data set generously provided by the OpenStreetMaps.org foundation, querying with a set of regularly spaced varying sized rectangles around London, UK. I recorded these four metrics:

  • Time to build the index, using a single thread and SerialMergeSchedule  with LogDocMergePolicy and fixed document count written to each segment, to keep comparisons fair, since the resulting indices have exactly the same segment structure
  • Size on disk of the resulting index
  • Heap size of the in-memory search-time index structures
  • Query time (average msec for 225 queries, best across 50 iterations)

I compare the legacy spatial module, the recently added GeoPoint queries (new in Elasticsearch 2.2.0 with further improvements coming in Elasticsearch 2.3.0 and included in these tests) and the new dimensional points, coming in Lucene 6.0.0:

Overall, both GeoPoint and the new dimensional points show substantial improvements across the board over the legacy spatial module, with an especially large reduction in index size and search time heap used, while GeoPoint has much faster indexing time than dimensional points, and dimensional points show faster querying time, smaller heap and slightly smaller index size.

Remember that no benchmark is perfect, and this one is no exception! 

First, it runs only the geo filter in isolation, but in practice a geo filter would normally be executed along with other MUST clauses where optimizations like two-phased support should have a big impact.  Second, it tests single-valued documents, while multi-valued cases will likely have different behavior.  Third, it's a standalone benchmark, so the hotspot compiler gets to unnaturally focus only on the specific indexing and searching code. Finally, the query shapes are a set of regularly spaced rectangles around London, UK, not actual user queries.

1D Performance

To test the 1D case, I indexed only latitude from the same data set, quantized to an integer (1D 4 byte point), and compared dimensional points with the legacy IntField, using regularly spaced varying sized range filters (just the 1D projection of the 2D test queries):

Dimensional points shows faster querying and indexing time, and a substantial reduction in search time heap used and index size.

The indexing time in particular benefited from a specialized bulk-merging implementation. I suspect a similar optimization may be possible in the 2D case but it has not yet been explored (patches welcome!).

Moving forwards

Please keep in mind that as of this writing, dimensional points are not yet released, so if you play with this new feature, using a snapshot build from Lucene's master branch, index file formats and APIs are free to suddenly and drastically change up until when 6.0 is released (hopefully coming soon). We would love to hear any feedback you may have!

Note that the dimensions need not all be spatial! One exciting potential use case is mixing geo-spatial dimensions with other dimensional values, such as median household income, into a single 3D dimensional index. This would then allow fast range filters against both median household income and arbitrary geo-spatial shapes. Another example is indexing time as one dimension along with other numeric dimensions to support business-intelligence use cases.

Currently, dimensional values must be single points, but some geo-spatial use cases require indexing shapes as well, which can be done with a generalization of the k-d tree called an R-Tree, for example, and I expect that will be a future improvement to Lucene (patches welcome again!).

[The image on the top was generated by the JavaGeom project]