This Week in Elasticsearch and Apache Lucene - 2017-05-29
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Elastic Cloud Enterprise is here. Centrally manage all of your #ElasticStack assets on any infrastructure with ease: https://t.co/8qrnDEz4TL pic.twitter.com/nXhBYBOosM
— elastic (@elastic) May 31, 2017
Live Primary/Replica syncs are off to a good start
It has been a long standing issue that replicas can fall out of sync when a primary shard fails. The differences are subtle and only relate to write operations that have been in-flight while the old primary failed. To date, these differences can only be resolved by doing a full file-based sync, which is slow and heavy. That heavy handed solution meant we have to wait for the next shard relocation to fix it, meaning that shard counts can be off for a quite a while. With the introduction of sequence numbers we now have a faster, ops based, synchronisation mechanism. Instead of aligning segment files (which can diverge quite a bit) we can identify and sync only the operations that need to be corrected. Since these are typically just a few operations, we can use this mechanism to re-sync the replicas with the new primary and make sure they are identical. This can be in real time and without stopping on-going indexing, which is pretty unique.
Originally we thought that using the new ops based recovery logic to do a live background sync would only make it somewhere in the 6.x series but we're making good head way (#24779, #24925, #24841, #24825) and version 6.0 will already have some of this functionality. Any operation that's present on the new primary will be replicated over to the replicas, making sure that's everything on the new primary is also on the replica. Of course, replicas can also have operations that are not present on the new primary, and this issue also needs to be addressed by allowing replicas to roll back those unacknowledged changes. This part will be coming during the 6.x life time and will not be part of 6.0.
Significant Text Aggregation
The new significant_text
aggregation has landed in master. This allows users to run significant terms on text fields without using field data, which greatly reduces the heap usage for significant terms on text
fields. Instead, the aggregation works by re-analysing the source field for the collected documents (or a sample thereof). This agg also contains a feature for removing duplicated text (such as email signatures or inline ads in articles) from the analysis to stop them from skewing the results.
Changes in 5.4:
- Some percolator queries from 2.x indices were not being rewritten in 5.x.
- The
indices
query could result in an infinite loop. - The
within
andcontains
relationships in range fields gave false positives. - Use explicit dependency exclusions instead of gradle-generated exclusions to work around a bug in Apache Ivy.
- A double decrement on the query counter stats could result in unexpected negative stats.
- The
Assertions.ENABLED
helper should simplify the code to use assertions. - The
_field_caps
API should check that a remote cluster is connected before fetching. - The aggregation parsers for the high level Java REST client have been backported to 5.x, along with support for scrolled search requests.
- Cached queries could return incorrect weights.
- The Lucene version constants for each version should be verified.
Changes in master:
- Using
?preference={session_id}
in a search request now takes theshardId
into account, instead of just choosing the first shard in the list, to more fairly distribute searches across nodes. - Version constants no longer need the
_UNRELEASED
suffix as we now use rules to determine which versions should be considered released, and thus subject to bwc testing. - A single translog generation file is guaranteed to never have conflicting sequence numbers that need to be resolved by examining the primary term, which will make repairing these conflicts simpler.
- The work to compile a script for use in a particular context is landing in master.
- Available and free disk stats should be protected from going negative on huge file systems.
Coming soon:
- A search whose sort parameters coincide with the index sort order should terminate as soon as enough hits have been retrieved.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!