Supercharging geo_point fields in Elasticsearch 2.2

“I want to location enable my application but where do I start? Elasticsearch has had geo support for some time but how does it work? What are all of those parameters for? What kind of performance can I expect?” The explosive growth of geospatial and spatiotemporal data for so many diverse use-cases emphasizes the need for efficient geospatial data structures and analysis tools. With many Geospatial specialty applications and libraries already available, few offer the scalability and flexibility of combined full text search and aggregations with location based information. Enter Elasticsearch. We hope this post answers some of those nagging questions regarding how geospatial field types work and how you can begin location enabling your Elasticsearch-driven applications.

So what's new?

Elasticsearch has supported geospatial fields for some time; and a short while ago we provided a demo and overview of the new features and performance improvements available in 2.0. While the existing geospatial field types benefit from some of the low level improvements in 2.0 (specifically doc_values), the 2.2 release supercharges geo_point fields by improving the underlying data structure and query approach.

Indexed by default

Prior to 2.2, geo_point fields used no default indexing structure (aside from a not_analyzed “lat,lon” string field). This is because, until Lucene 5.3, a core Geo data structure based on the internal inverted index (that did not require a third-party library associated with a restrictive software license) did not exist. In order to circumvent this limitation, so the geo Query DSL could be used, the geo_point field mapping required specifying at least one of: lat_lon : true, geohash : true. At the implementation level the lat_lon option stored latitude and longitude data as two separate numeric fields, while the geohash option encoded all geo_point fields as an alphanumeric string based geohash. Since both structures are based on the number, and string core types, respectively, supporting 2D geo data indexing required no additional indexing code.

With the release of Lucene 5.3, a new specialized GeoPointField type is now available. This field type is built on the internal inverted index structure and all of the amazing work accomplished by the Lucene Community to make this structure as efficient and performant as possible. While an inverted index is not a typical structure used for geospatial data, it can be remarkably fast when applied correctly; and since the structure reuses all of the same codec logic implemented by Lucene, all of the dangerous corruption issues related to introducing new file types for new data structures are less of a concern. In order to work with the inverted index, the GeoPointField type uses a quad-tree raster graphics based approach by encoding latitude and longitude values as a single 64 bit integer and using variable length prefix codes for the terms in the term dictionary. The following graphic illustrates this technique for geo_point data.

Graphic above is based on a graphic from Spatial Keys - Memory Efficient Geohashes by Karussell.

Further encoding improvements were then applied to minimize the size of the terms dictionary, thus minimizing the overall size of the index. With the performance improvements to doc values there is no need to store every prefix term up to the full resolution. Each point can be approximated using the prefix as a precision step of the top 32 most significant bits. Points along the search boundary can then be further scrutinized using full precision obtained from doc_values. This two-phase approach strikes a balance between index size and processing time.

Rasterized queries

The new GeoPointField approach brings improved efficiency for all geo query capabilities available in Elasticsearch. Prior to 2.2, geo_point field queries were accomplished using a two-phase approach (two-phase iterator in the 2.0 release). The first phase used either a prefix stemmer for geohash indexed points, or lat/lon numeric range queries for lat_lon indexed points to query based on the bounding box of the search area. The full precision results were then checked against the search criteria (e.g., polygon, distance, distance range). While the improvements to the two-phase iterator improved query efficiency (especially for bounding boxes), complex distance and polygon queries still required a check of every point within the bounding box.

The 2.2 indexing improvements minimize these checks by applying the benefits provided by the inverted index. As illustrated by the following distance query graphic (taken from a live demo in this YouTube video), each query is “rasterized” into a set of “within” and “boundary” terms.

To minimize the number of terms visited, the rasterization step attempts to maximize the coverage area of each term (cell). Terms that intersect the query shape use doc values to further scrutinize candidate points, while terms that are fully contained by the query area simply return all documents from the postings list. This raster decomposition technique reduces the number of brute force checks required by throwing out all terms that are represented by the bounding box but not the shape.

Streamlined mapping parameters

Prior to 2.2, the geo_point field included 8 optional parameters whose default values were partially based on best practices. For example, whether to index lat_lon or geohash (or both) was often a question that could not be answered until much later in the maturity of the application use-case. By that time geo queries may have already become the culprit for poor quality of service. Compliance with geospatial data standards, created by the Open Geospatial Consortium and International Standards Organization, have been a work in progress requiring leniency options such as coerce and ignore_malformed. Other “expert” parameters such as precision_step and doc_values are becoming more “look don’t touch” than valuable performance enhancing tools.

Since the new GeoPointField structure is designed to obtain the best geospatial index and search performance possible within the construct of the inverted index, the number of parameters is being reduced in the coming releases. The image below provides the parameters for the 2.2 release.

The coerce and doc_values options are completely removed as they are already handled by the new GeoPointField type. The ignore_malformed option will remain to provide users with the flexibility of throwing an exception if Elasticsearch receives non-compliant data (e.g., points outside the standard coordinate system). For backward compatibility the remaining parameters will remain. Users can still store geo_points in lat_lon and/or geohash sub-fields (setting the numeric precision_step, geohash_precision, and geohash_prefix parameters will still affect the size of the terms dictionary) but the geo Query DSL will no longer search using these sub-fields. They are purely for retrieving the geo_point data using the .lat, .lon, or .geohash path.

Performance results

The performance results below are a 1.5 week average benchmark using the following parameters: index 60.8M documents (attributed to Planet OSM) using 8 client threads and 5000 docs per _bulk request against a single node running on a Intel Core i7-4760HQ (4 real cores, 8 with hyperthreading), 16 GB RAM and 500 GB Samsung 840 EVO mSATA SSD.

While the benchmarks demonstrate a significant supercharge in performance using the new indexing and query approach, we feel we can do far better using a data structure designed specifically for multi-dimensional geospatial data. For this reason we are working hard to bring a new dynamic tree structure specifically designed for spatial data to a future version of Lucene and Elasticsearch. Stay tuned...