This Week in Elasticsearch and Apache Lucene - 2017-04-18
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Elasticsearch as an Android Malware Research Platform https://t.co/vtlFKsoR5z pic.twitter.com/rJhrjmWzKi
— Andrea De Pasquale (@a_de_pasquale) April 15, 2017
Changes in 5.4:
- Search hits and aggregations are now reduced in batches to reduce memory usage on the coordinating node. This allows us to remove the 1,000-shard soft limit.
- The
_remote/info
API provides information about remote clusters configured for cross cluster search. - Remote cluster names can now use wildcards, including
*
to match all configured clusters. - Fixed the handling of array settings when default settings are also present.
- Remove support for default settings except for
path.data
,path.conf
, andpath.logs
, which will be removed in a later version. - The
path.data
environment variable should not be set unless explicitly set by the user. - A node which detects remnants of the
path.data
/default.path.data
bug will refuse to start. - Warn when not enough master eligible nodes are present.
- Load S3 plugin static settings eagerly so that the secure settings keystore can be closed.
- The
_field_stats
API has been deprecated in favour of_field_caps
. - There was a race condition when recovering replicas at the same time as relocating the primary.
- Fixed a memory leak when using inner hits inside a nested query by replacing
NestedChildrenQuery
withParentChildrenBlockJoinQuery
. - Duplicate command line settings are no longer allowed.
- The JNA library is now built by Elastic with native libs for all of the platforms we support.
- Deduplicate
warning
headers by hand instead of with a slow regex. - Reject empty document IDs.
- The context suggester now accepts numeric and boolean contexts, not just strings.
- Closing a
ReleasableBytesStreamOutput
now releases the underlyingBigArray
so that these streams can usetry-with-resources
. - The secure settings keystore can now store files, needed for the GCS repository.
- Older nodes should be able to parse
TaskInfo
from newer nodes, ignoring any new elements. - The build will fail if code tries to log before logging is configured.
- Hidden files in the plugin directory are no longer ignored.
- Shadow replicas have been removed.
- It is no longer possible to specify custom
ES_USER
andES_GROUP
in packages. - Sufficient translog generations are now preserved to ensure that shards can recover from their local checkpoint.
- Sequence numbers are now used instead of version numbers to identify out-of-order indexing/delete operations during replication and recovery.
- Removed code to support old Lucene versions that didn't write checksums.
- The Java high level REST client is in the final (and longest) stage: learning to parse aggregation responses.
Apache Lucene
The 6.5.1 release is delayed
A Solr bug is delaying 6.5.1, which triggered a discussion about whether we should still get 6.5.1 out and work on getting 6.5.2 released short afterwards, or whether we should wait for the bug to be fixed before building a new release candidate for 6.5.1. Weak consensus seems to be to wait.
Elasticsearch master is now on Lucene 7
Elasticsearch master has been upgraded to a Lucene 7 snapshot so that we can start verifying what impact it has for us, especially in terms of disk footprint and performance given changes around sparse norms and doc values. The nightly benchmarks should pick up this change as of tomorrow.
Other changes:
- TermInSetQuery should expose a way to know which field it runs on.
- HeatMapFacetCounter should skip segments with no values.
- The KNN classifier and More-like-this are moving to BM25 rather than TF-IDF.
- Could we only run precommit on files that need it?
- We should check that close listeners are not registered on closed readers.
- OfflineSorted should not consume exhausted iterators.
- Can we make RAMDirectory faster by removing unnecessary synchronization?
- An issue about exposing how much memory BKDWriter may use quickly turned into making offline sorting faster.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!