This Week in Elasticsearch and Apache Lucene - 2016-04-18
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
#elasticsearch 5.0 will use the new Lucene 6 points API to index numeric, date and ip fields
— elastic (@elastic) April 15, 2016
Elasticsearch Core
Changes in 2.x:
- Fixed a CORS bug in pre-flight requests.
- Selecting concrete indices to restore could result in incorrect selection.
- Fixed a bug which prevented reallocating shards unless the cluster was green.
- Added a circuit breaker to limit the total size of in-flight requests.
- Extended stats agg could result in incorrect results in presence of missing buckets.
Changes in master:
- Upgrade to Lucene 6.0.0 and switch numeric, date and IP fields across to the new point format, including exposing point field stats in the segments API. Elasticsearch now supports IPv6!
- JVM options now have their own config file.
- The query cache can be disabled with a per-index setting.
- Added ignore_unmapped option to parent-child, nested, and geo queries to allow multi-index queries to work on indices which don't have the appropriate mappings. This allowed deprecating the indices query.
- The reindex API supports disabled _source gracefully.
- Field stats now treats floats and doubles as the same field type.
- Bootstrap checks are now triggered when the node is bound to any host other than localhost, and outputs all failures at once. Also added bootstrap checks for correctly configured heap size and the max map count for virtual memory.
- Several refactorings of mappings to make the code cleaner and easier to follow.
- Percolator queries now support position_increment_gap.
- EC2 discovery is now tested.
- Improved analysis of wildcards and stacked tokens in query strings.
- Shard-level bulk action tasks now track their parent tasks correctly.
- Tidy-ups to the IndicesService class in preparation for adding deleted index tombstones.
Ongoing:
- Many PRs to remove PROTOTYPE and use aggregation registry.
- Adding tombstones to the cluster state for deleted indices.
- The allocation explain API will display shard store info when appropriate.
- It should be possible to disable strict JSON quoting for bwc during 5.x.
- The task manager should be able to persist the results of long running tasks like reindex.
- EmptyQueryBuilder is being removed.
- Java HTTP client support sniffing, now working on connection pooling.
- The .percolator type will be replaced by a percolator field.
- Indexed scripts will change to stored scripts and live in the cluster state, instead of in an index. Should there be a soft-limit on how many scripts can be stored?
- The DSL for inner hits is being cleaned up, and the top-level inner hits DSL is being removed.
- Aggregation names can overwrite other keys like doc_count. Is this a problem?
Apache Lucene
- It is nearly impossible to come to agreement on how best to name Lucene's numerous new geospatial search implementations, thus demonstrating yet again the Law of Triviality
- Here's a nice 3D visualization of how Lucene's new dimensional points slice up the surface of the earth for fast searching
- Lucene's geo benchmark, which tests 61 M points exported from OpenStreetMaps, gets new features, including testing distance sort, filters and nearest neighbor performance, an option to pre-build queries, and reporting overall M-hits/sec in addition to QPS
Geo3D
continues to move at a fast pace:- Performance improvements when searching for large polygons
- "Wacko" random polygons cause tricky test failures
- Testing for shape intersections with a polygon is no longer N^2 in the number of polygon vertices
- Tests caught a garden variety attempt to create an illegal shape
- Finding an interior point was a big bottleneck for constructing
geo3d
polygon queries Geo3d
tests have the most sophisticated BKD forensics of all our tests, so you can see precisely why a given doc did or did not match- Convex polygons were being mis-classified as concave resulting in way too many hits in our geo benchmark
Geo3D
needs support for sort-by-distance and nearest neighbor as well- Tiling the incoming polygon is a costly part of
geo3d's
polygon query
- Geo2d does as well:
- A nice simple point-in-polygon algorithm from the 70s gives a speedup to
LatLonPoint's
polygon query LatLonPoint
gets a fast nearest neighbor search, thanks to the efficient BKD tree, to find the N nearest indexed points to a provided query point- Speed up
LatLonPoint.newDistanceQuery
by working withhaversinSortKey
instead of the full haversin distance - A spooky test failure turned out to be an innocent test bug
LatLonPoint's
distance query has become so fast that we decided to rremove two-phased support, since its overhead is not worth its savings- The grid we use to speed up
LatLonPoint's
polygon queries struggled with wee tiny polygons - Our new, more evil random lat/lon generation uncovered a tricky test failure by creating a "rectangle" that was in fact a line!
- The
EarthDebugger
gets some improvements such as stating which location you want the earth to rotate to on load, control over the rectangle colors, and some performance improvements - Better random latitude/longitude generation continues to uncover issues
- We have moved common geo encoding APIs to core so they can be shared across implementations
- A new encoding for
GeoPointField
will be consistent withLatLonPoint,
and use all 64 available bits to minimize quantization error
- A nice simple point-in-polygon algorithm from the 70s gives a speedup to
- The XML query parser is conflicted about whether lower and upper bounds on ranges are optional
XMLQueryParser's
tests now let subclasses pick the analyzer- The legacy spatial module gets faster by switching from
FixedBitSet
toDocIdSetBuilder
, matching how the three new geo implementations work - The
DataSplitter
in Lucene's classification module should pay attention to classes when splitting - Remove a wrong copy-paste comment in our replicator module
- The details included in exception messages are important
NRTCachingDirectory
had a sneaky concurrency bug- Our release scripts still struggle with the switch from Subversion to git
- Randomized tests uncovered an extremely rare failure when two randomly generated doubles were exactly 2 ulps apart
- Highlighting fails to find terms inside the child query of a
BlockJoinQuery
- Yes, Lucene 7.0 will in fact support all 6.x indices
Math.toRadians
is changing its results slightly between Java 1.8 and 1.9, so we are trying to avoid relying on it- Moving
ValueSource
andFunctionValues
to Lucene's core can break inter-module dependencies ToParentBlockJoinQuery's
explain is lame today and should instead include the explanation from its children- Query parsers are confusing when a clause has only stop words
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!