This Week in Elasticsearch and Apache Lucene - 2016-04-05
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
#Elasticsearch 5.0.0-alpha1 release w/#Lucene 6, ingest node, & more!
Details: go.es.io/1MQddXD
— elastic (@elastic) April 5, 2016
Elasticsearch Core
Changes in 2.3:
- Check that a translog is still open when asking for a new view on it.
- Some columns in the cat APIs had duplicate column aliases.
- Fixed an ArrayOutOfBounds exception when running aggregations on shards without values.
Changes in master:
- The new /_cluster/allocation/explain API explains why a shard can or cannot be allocated to nodes in the cluster.
- Type filters no long impact query time when there is only one type.
- Dynamically added string fields now add a main "text" field and a sub "keyword" field. Text fields have fielddata disabled by default.
- New dynamically settable soft limits added to protect unaware users from dangerous practices:
- Limit the number of fields that can be added to an index.
- Limit the maximum depth of mappings.
- Limit the number of shards that can be searched.
- Node attributes must now be specified with `node.attr.xxx` instead of `node.xxx`.
- The node.client setting has been removed in favour on node.master|data|ingest.
- Throttling of an in-flight reindex request can now be updated dynamically.
- The task management API can now return tasks grouped by parent task.
- Explain on percolator queries now only runs on queries which could match.
- The percolator query now supports scoring.
- Fixed a bug allowing OOMs when recovering from the translog.
- Removed the deprecated "reverse" option from sorting.
- Don't hide stack traces when throwing exceptions.
- Translog configuration is now immutable.
- Cluster health checks should wait for the state to be applied, not ignore in-flight requests.
- Inner hits has been refactoredz which means that the search refactoring is now complete, bar some minor cleanups.
- The convert ingest processor now supports an auto option to auto-detect date, boolean, and numeric types.
- The IndexOperationListener now reports whether a document was created or not.
- The Painless code has been cleaned up moving all Java code out of the ANTLR grammars, improving error messages, and optimizing access to _score.
Ongoing changes:
- Work continues on removing PROTOTYPE from our code base.
- Adding index deletion tombstones to the cluster state to prevent old indices from popping back into existence.
- The task management API should indicate which tasks can be cancelled.
- The function_score query will learn how to combine scores from multiple queries.
Apache Lucene
- It looks like we will release Lucene 6.1.0 before 6.0.0!
- The second release candidate for 6.0.0 is out! Go test it and vote!
- Distance queries get much faster with a better test for whether a BKD cell overlaps a circle on the earth's surface, but required this cool whole-earth debugger to help understand the tricky cases
- We now have much better
Polygon
support, including multi-polygons, optionally containing holes, such that we can run real-world polygons, like Russia, without exhausting a 10 GB java heap - The newly created
GeoTestUtil
now has useful APIs for making random surprise-me polygons like these exotic nuclear-warfare-like shapes, and the base test class is now simpler - Spatial tests now use
SerialMergeScheduler
for better reproducibility - The bare essential geo spatial utility APIs are moving to core and being consolidated so all spatial modules can share them
LatLonPoint
andGeoPointField
should quantize in exactly the same way - We now use precisely the same constant for the mean radius of the earth when it's modeled (approximately) as a sphere
- It's tricky to get javadocs working across our spatial modules
- Our release tools still have remnants of subversion, and struggle with how we name our release branches
- Geo3d gets easy-to-use APIs matching our geo2d APIs
- The document classifier confusion matrix had buggy accuracy and precision calculations
- The
spatial-extras
module has cutover to points OfflineSorter
more efficiently handles fixed-width values used by dimensional points- We are struggling with query-time quantization issues with
LatLonPoint
andGe<wbr>oPointField,
includingNaNs
- Reduce the number of polygon utility methods
- We now sometimes test triangle shapes in our geo tests
GeoPointRangeDistanceQuery
does not work with multi-valued documents MoreLikeThisQuery
should keep track of which terms came from which fields, but this seems to cause at least one test failure- Improve testing for long ordinals in
BKDWriter
without having to index 2.1 billion points OfflineSorter
should not always merge down to one segment in the endGeoPointField
should use the same full 64 bit encoding asLatLonPoint
- Geo3d will also support polygons with holes, but handling "sideness" of a polygon is somewhat tricky for
geo3d
- We can optimize polygon queries with faster checks for whether BKD cells overlap the query polygon
- Sometimes,
BooleanQuery's
explain method can lie about its score - Document classifier should also look at numeric fields
- Should Lucene support boolean subset matching?
- The legacy
UninvertingReader
class won't get multi-valued points support SpanNearQuery
can assign the wrong score when inner clauses overlap- Our web site still embarrassingly shows the latest subversion commits!
- Another randomized geo test failure, this time on a tiny radius (14.3 cm!)
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!