本页内容尚不支持所选语言。Elastic 正在不断努力，以实现对多种语言内容的支持。感谢您在此期间给予的耐心与陪伴！

January 7, 2016

Supercharging geo_point fields in Elasticsearch 2.2

“I want to location enable my application but where do I start? Elasticsearch has had geo support for some time but how does it work? What are all of those parameters for? What kind of performance can I expect?” The explosive growth of geospatial and spatiotemporal data for so many diverse use-cases emphasizes the need for efficient geospatial data structures and analysis tools. With many Geospatial specialty applications and libraries already available, few offer the scalability and flexibility of combined full text search and aggregations with location based information. Enter Elasticsearch. We hope this post answers some of those nagging questions regarding how geospatial field types work and how you can begin location enabling your Elasticsearch-driven applications.

So what's new?

Elasticsearch has supported geospatial fields for some time; and a short while ago we provided a demo and overview of the new features and performance improvements available in 2.0. While the existing geospatial field types benefit from some of the low level improvements in 2.0 (specifically doc_values), the 2.2 release supercharges geo_point fields by improving the underlying data structure and query approach.

Indexed by default

Prior to 2.2, geo_point fields used no default indexing structure (aside from a not_analyzed “lat,lon” string field). This is because, until Lucene 5.3, a core Geo data structure based on the internal inverted index (that did not require a third-party library associated with a restrictive software license) did not exist. In order to circumvent this limitation, so the geo Query DSL could be used, the geo_point field mapping required specifying at least one of: lat_lon : true, geohash : true. At the implementation level the lat_lon option stored latitude and longitude data as two separate numeric fields, while the geohash option encoded all geo_point fields as an alphanumeric string based geohash. Since both structures are based on the number, and string core types, respectively, supporting 2D geo data indexing required no additional indexing code.

With the release of Lucene 5.3, a new specialized GeoPointField type is now available. This field type is built on the internal inverted index structure and all of the amazing work accomplished by the Lucene Community to make this structure as efficient and performant as possible. While an inverted index is not a typical structure used for geospatial data, it can be remarkably fast when applied correctly; and since the structure reuses all of the same codec logic implemented by Lucene, all of the dangerous corruption issues related to introducing new file types for new data structures are less of a concern. In order to work with the inverted index, the GeoPointField type uses a quad-tree raster graphics based approach by encoding latitude and longitude values as a single 64 bit integer and using variable length prefix codes for the terms in the term dictionary. The following graphic illustrates this technique for geo_point data.

Graphic above is based on a graphic from Spatial Keys - Memory Efficient Geohashes by Karussell.

Further encoding improvements were then applied to minimize the size of the terms dictionary, thus minimizing the overall size of the index. With the performance improvements to doc values there is no need to store every prefix term up to the full resolution. Each point can be approximated using the prefix as a precision step of the top 32 most significant bits. Points along the search boundary can then be further scrutinized using full precision obtained from doc_values. This two-phase approach strikes a balance between index size and processing time.

Rasterized queries

The new GeoPointField approach brings improved efficiency for all geo query capabilities available in Elasticsearch. Prior to 2.2, geo_point field queries were accomplished using a two-phase approach (two-phase iterator in the 2.0 release). The first phase used either a prefix stemmer for geohash indexed points, or lat/lon numeric range queries for lat_lon indexed points to query based on the bounding box of the search area. The full precision results were then checked against the search criteria (e.g., polygon, distance, distance range). While the improvements to the two-phase iterator improved query efficiency (especially for bounding boxes), complex distance and polygon queries still required a check of every point within the bounding box.

The 2.2 indexing improvements minimize these checks by applying the benefits provided by the inverted index. As illustrated by the following distance query graphic (taken from a live demo in this YouTube video), each query is “rasterized” into a set of “within” and “boundary” terms.

To minimize the number of terms visited, the rasterization step attempts to maximize the coverage area of each term (cell). Terms that intersect the query shape use doc values to further scrutinize candidate points, while terms that are fully contained by the query area simply return all documents from the postings list. This raster decomposition technique reduces the number of brute force checks required by throwing out all terms that are represented by the bounding box but not the shape.

Streamlined mapping parameters

Prior to 2.2, the geo_point field included 8 optional parameters whose default values were partially based on best practices. For example, whether to index lat_lon or geohash (or both) was often a question that could not be answered until much later in the maturity of the application use-case. By that time geo queries may have already become the culprit for poor quality of service. Compliance with geospatial data standards, created by the Open Geospatial Consortium and International Standards Organization, have been a work in progress requiring leniency options such as coerce and ignore_malformed. Other “expert” parameters such as precision_step and doc_values are becoming more “look don’t touch” than valuable performance enhancing tools.

Since the new GeoPointField structure is designed to obtain the best geospatial index and search performance possible within the construct of the inverted index, the number of parameters is being reduced in the coming releases. The image below provides the parameters for the 2.2 release.

The coerce and doc_values options are completely removed as they are already handled by the new GeoPointField type. The ignore_malformed option will remain to provide users with the flexibility of throwing an exception if Elasticsearch receives non-compliant data (e.g., points outside the standard coordinate system). For backward compatibility the remaining parameters will remain. Users can still store geo_points in lat_lon and/or geohash sub-fields (setting the numeric precision_step, geohash_precision, and geohash_prefix parameters will still affect the size of the terms dictionary) but the geo Query DSL will no longer search using these sub-fields. They are purely for retrieving the geo_point data using the .lat, .lon, or .geohash path.

Performance results

The performance results below are a 1.5 week average benchmark using the following parameters: index 60.8M documents (attributed to Planet OSM) using 8 client threads and 5000 docs per _bulk request against a single node running on a Intel Core i7-4760HQ (4 real cores, 8 with hyperthreading), 16 GB RAM and 500 GB Samsung 840 EVO mSATA SSD.

While the benchmarks demonstrate a significant supercharge in performance using the new indexing and query approach, we feel we can do far better using a data structure designed specifically for multi-dimensional geospatial data. For this reason we are working hard to bring a new dynamic tree structure specifically designed for spatial data to a future version of Lucene and Elasticsearch. Stay tuned...

上下文工程

向量数据库

Search AI 驱动的应用程序

日志

威胁防护

工作流

Elasticsearch

Kibana（Discover、仪表板）

Elastic 智能体生成器

自动操作

管道化查询语言

Jina AI 搜索模型

Elastic Cloud Serverless

Elastic Cloud 托管

自管型 Elasticsearch

电子商务搜索

客户服务搜索

搜索驱动型应用程序

日志分析

基础架构监测

数字体验监测

应用性能监控

AIOps

LLM 可观测性

新一代 SIEM

安全工作流

XDR 和终端安全

面向安全的 AI

实现数据价值十倍跃升

云服务提供商

Elastic AI 生态系统

Search AI 合作伙伴计划

AV-Comparatives

Forrester Wave™ 领导者

Gartner 魔力象限领导者

IDC MarketScape 领导者

Search

安全性

可观测性

开始使用

演示库

下载

集成

文档

Elasticsearch Labs

Elastic 安全实验室

Elastic 可观测性实验室

博客

社区

活动

网络研讨会

讨论

培训

支持

咨询

Supercharging geo_point fields in Elasticsearch 2.2

So what's new?

Indexed by default

Rasterized queries

Streamlined mapping parameters

Performance results