This Week in Elasticsearch and Apache Lucene - Core Changes
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
What would life be like without #Elasticsearch? #Elasticon attendees answer: https://t.co/46K48dul4Y pic.twitter.com/3dDINVzDec
— elastic (@elastic) February 19, 2016
Elasticsearch Core
Changes in 2.2:
- Snapshot/restore now verifies that the index being restored is compatible with the version of the node doing the restore.
- The bulk API no longer broadcasts deletes to all shards, and will fail if custom routing is enabled and no routing value is specified.
Changes in 2.3:
- Nodes will only accept transport requests once they are fully initialized.
- Groovy accepted our pull request which means that the suppressAccessChecks permission is no longer required.
Changes in master:
- Document IDs now have a hard limit of 512 bytes
- The HTTP address and port is now available in cat-nodes and cat-nodeattrs
- The Painless scripting language is now a module, which means that it will ship by default.
- Log4J is now the only supported logger wrapper and may yet be removed in favour of java.util.logging.
- Using a custom network.host setting as a proxy for "production cluster" allows us to upgrade soft warnings (in dev mode) to hard exceptions. This change has proved controversial as configuring max open file handles on OSX is overly complex.
- Elasticsearch now checks on startup that all data paths are writable.
- G1GC on early versions of HotSpot v25 are buggy.
- Some hot methods have been refactored so that they can be inlined.
- Various unused/unneeded settings have been removed: es.max-open-files, es.netty.gathering, es.useLinkedTransferQueue, line.separator, action.search.optimize_single_shard
Ongoing changes:
- Tasks now have timestamps to see how long they have been running.
- Task IDs are now represented as single strings instead of tuples of node ID and task ID.
- Index names will no longer be tied to the name of the index folder on disk.
- Dangling indices will no longer be imported if the cluster UUID of the index is the same as the current cluster UUID (which indicates that the index was deleted while a node was incommunicado).
- The segments API will be able to return the disk use by Lucene file-type.
- Work continues on trying to allow dots in field names.
Apache Lucene
- Lucene 5.5.0 was officially released on February 22nd, but whether a 5.6.0 release will happen even after we switch to 6.x stable releases has proven to be strangely contentiousÂ
- Lucene 6.0.0 release process will begin early this week, with cutting the 6.x branchÂ
- TheÂ
CheckIndex
 tool would sometimes hit an exception-during-exception (an exception when trying to throw another exception due to index corruption), becauseÂBytesRefBuilder.<wbr>toString
 is not allowed - Lots of scrutiny and many improvements to the new points queries in preparation for the 6.0.0 release:Â
- support forÂ
BigInteger
 andÂInetAddress
 (v4 and v6!) - better validation of the incoming arguments
- a newÂ
PointInSetQuery,
 matching any documents that have any of the values in the set of points - javadocs improvements
- improving geo3d apis
- removing sandbox'sÂ
PointInRectQuery
 in favor of the faster coreÂPointRangeQuery
- moving all encode/decode methods onto theÂ
XXXPoint
 classes - a cleaner API, where theÂ
XXXPoint
 classes have static factory methods to generate their matching queries, and additional API improvements
- support forÂ
- Even more verbosity for a non-reproducible test failure that only fails on OS X, rarelyÂ
- Another fix in the long tail of our switch from Subversion to gitÂ
- The silly things we must do to silence our overly naggy java compilerÂ
- More improvements toÂ
MMapDirectory
in preparation for Java 9, but we continue to uncover new Java 9 bugs like this serious bug in method handles though progress is being made towards a fix - The Java 9 bug Lucene's tests uncovered last week has been resolved as a duplicate of another (already fixed but not yet released) bugÂ
- Creating aÂ
hashCode
 that does not accidentally cause high collision rate is not easy! - 800+ new top-level-domains have been created since we last fixedÂ
StandardTokenizer
 to detect them! - Heavy delete-by-query use in Lucene is costlyÂ
- Lucene's range faceting can't yet handle multi-valued fields (patches welcome!)Â
- The legacy spatial code will move to a newÂ
spatial-extras
 module soonÂ
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!