This Week in Elasticsearch and Apache Lucene - 2017-04-03
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
That time we wired up a ballerina to index her movements into #Elasticsearch & visualize with #Kibana: https://t.co/3BEQZW1vsf #elasticon pic.twitter.com/kcoRNdvkTC
— elastic (@elastic) March 23, 2017
Changes in 5.3:
- Fixed parsing of
search.remote.node.attr
which tells Elasticsearch which remote nodes can perform cross-cluster search. - Cluster stats should not render empty http/transport types.
- Sliced scrolls could partition documents differently on different nodes as the hashCode initialisation depended on current system time.
- Reindex-from-remote stopped working with nodes before 2.0 because older versions require the scroll ID as a plain-text body.
- The update request's
timeout
parameter was not being passed to the underlying index/delete requests. - Added infrastructure to mark internal requests as such, so that these requests can be executed as the system user.
Changes in 5.x:
- Lucene upgraded to 6.5.0.
- The field capabilities API provides a lightweight API to return a list of field names deduped across multiple types/indices, and whether each field can be used for search or aggregations.
- Streamlining shard index availability in all SearchPhaseResults allows the removal of AtomicArray.Entry, which greatly reduces the number of objects created in every search phase. A further cleanup removed AtomicArray completely.
- CollapseTopFieldDocs can now do incremental reduction, like the TopDoc variants from Lucene.
- Older nodes should not be able to join a cluster which contains indices created by newer nodes.
- Azure repositories now use a configurable exponential backoff policy, and Azure Storage has been updated to 5.0.0.
- The keyword marker token filter now supports pattern matching, which allows excluding short words from being stemmed.
- The new Boolean Similarity allows field scoring to disable all text scoring features like IDF, TF, and length norm.
- MGet requests should not accept unknown top-level parameters.
- Improved error message for epoch dates with a non-UTC timezone.
Changes in master:
- The global
repositories.azure
andrepositories.s3
settings have been removed. - Translogs are now automatically rolled over when they reach 64MB, making it easier to keep disk usage under control when multiple generations of translogs are required by sequence numbers.
- It is dangerous to execute delete-by-query if no query has been specified explicitly.
Apache Lucene
Lucene 6.5.1
We'll likely be starting the release process for 6.5.1 next week. There are chances it will be ready soon enough for Elasticsearch 5.4.0.
Potential memory leak with parent/child queries
This is similar to this other bug with term queries. In short, queries that reference IndexReaders (either directly or transitively) are problematic because queries might stay in the cache long after the index that they have been run against has been closed. This makes these queries hold references to the in-memory data-structures that Lucene maintains on top of the data that sits on disk and which can normally be reclaimed once a segment is removed. A fix for this bug will be in Lucene 6.5.1.
Other changes
- The recent improvements to range queries were actually disabled by the query cache due to the fact it does not propagate the new scorerSupplier API. A fix for this bug will hopefully be in 6.5.1.
- We are continuing to make the code better based on Findbugs reports.
- Should we replace all language-specific analyzer impls with constant instances of a CustomAnalyzer?
- The grouping collectors can be simplified.
- Adding the index creation version to segments metadata forced us to refuse to merge indices that had been created by different versions, since there would be no index creation version that would make sense afterwards. So we are looking into only storing the major version that was used at index creation time, and adding metadata to the segments about the oldest Lucene version that contributed to this segment.
- The fact that n-gram token filters do not update offsets is confusing to some users, yet this is the desired behaviour in some cases, and changing offsets in token filters can't be done in a way that is guaranteed correct in all cases.
- The release process should produces sorted indices that we could test backward compatibility against.
- We would like to explore using GPU to speed up Lucene via the Google Summer Of Code program. The expected outcome of this project as well as possible paths for inclusion in Lucene are still under (interesting) discussion.
- ComplexPhraseQueryParser produces queries that the unified highlighter cannot highlight.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!