May 29, 2017

This Week in Elasticsearch and Apache Lucene - 2017-05-29

By Clinton GormleyAdrien Grand

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Live Primary/Replica syncs are off to a good start

It has been a long standing issue that replicas can fall out of sync when a primary shard fails. The differences are subtle and only relate to write operations that have been in-flight while the old primary failed. To date, these differences can only be resolved by doing a full file-based sync, which is slow and heavy. That heavy handed solution meant we have to wait for the next shard relocation to fix it, meaning that shard counts can be off for a quite a while. With the introduction of sequence numbers we now have a faster, ops based, synchronisation mechanism. Instead of aligning segment files (which can diverge quite a bit) we can identify and sync only the operations that need to be corrected. Since these are typically just a few operations, we can use this mechanism to re-sync the replicas with the new primary and make sure they are identical. This can be in real time and without stopping on-going indexing, which is pretty unique.

Originally we thought that using the new ops based recovery logic to do a live background sync would only make it somewhere in the 6.x series but we're making good head way (#24779, #24925, #24841, #24825) and version 6.0 will already have some of this functionality. Any operation that's present on the new primary will be replicated over to the replicas, making sure that's everything on the new primary is also on the replica. Of course, replicas can also have operations that are not present on the new primary, and this issue also needs to be addressed by allowing replicas to roll back those unacknowledged changes. This part will be coming during the 6.x life time and will not be part of 6.0.

Significant Text Aggregation

The new significant_text aggregation has landed in master. This allows users to run significant terms on text fields without using field data, which greatly reduces the heap usage for significant terms on text fields. Instead, the aggregation works by re-analysing the source field for the collected documents (or a sample thereof). This agg also contains a feature for removing duplicated text (such as email signatures or inline ads in articles) from the analysis to stop them from skewing the results.

Changes in 5.4:

Changes in 5.5:

Changes in master:

Coming soon:

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!