This Week in Elasticsearch and Apache Lucene - 2017-04-10
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Batched reduction of search results
Elasticsearch has gained the ability to reduce the top N shard results incrementally. This is the next step towards reducing intermediate memory consumption during search execution that will allow us to relax current limitations related to the number of shards that are involved in the execution of a single search request. Ultimately this will allow search executions against large number of shards or in other words hitting the entire cluster with sustainable memory consumption on the coordinating node.
_field_stats is dead, long live _field_caps
After several PRs, the _field_stats API has been deprecated in favour of a more lightweight API that only exposes unique field names across indices/types, and whether they are
aggregatable, called the
_field_caps API. One important bit of the _field_stats API was the ability to return the min/max values of fields alongside filtering capabilities so that Kibana could figure out which time-based indices a query would match based on its time filter. This was important as including too many indices in a search request could cause memory issues since Elasticsearch would have to hold all shard responses in memory on the coordinating node. In order to work around this issue, we introduced incremental reduction of shard responses in order to keep memory usage on the coordinating node under control. As a consequence, Kibana should now be able to query all indices all the time, and Elasticsearch will make sure to execute things efficiently.
A common misunderstanding is that this change will make queries more costly since they now have to go to all shards. However it was already the case with the
_field_stats API. Now, queries that have a time filter that does not match any documents will return instantly, so we do not expect any performance regression.
Non-deterministic distributions in Rally
Rally gained support for modelling non-deterministic distributions of operations. With this change you can model Poisson processes with Rally out of the box. Poisson processes are often used to model arrival processes of independent actors (think coffee shops, telephone hotlines and most importantly for us: Web services) with a defined mean "arrival rate" (i.e. throughput). They play a central role in queuing theory.
Changes in 5.3:
- Disable graph analysis of a shingles or CJK token filter with mixed shingle lengths to prevent OOMs.
- Lucene upgraded to 6.4.2.
- Closing a ReleasableBytesStreamOutput closes and releases the underlying BigArray
significant_termsaggs should parse the
filterelement as a filter, not as a query.
- Custom sorting has been replaced with SortedSetSortField and SortedNumericSortField where possible.
- Only the master node should be validating the
- Response headers (eg deprecation warnings) should be preserved when creating an index and updating index settings.
- A new
single_nodediscovery type disables bootstrap checks so that users can test the transport client against a node started in Docker.
- The process for merging sorted top docs has been simplified to use the same code path regardless of whether there are multiple shards or only a single shard with results.
- The percolator now supports
rangequeries which use
- Settings marked with the
Finalproperty cannot be updated.
- Sensitive EC2 settings are now stored in the Elasticsearch keystore.
- InternEngine's index/delete flow has been refactored for better clarity.
partialsnapshot that included some successful and some failed indices now reports the snapshot status as
failedinstead of throwing a NoSuchFileException.
- Disabled retries in the S3 blob store as the Amazon S3 client already does retries.
- Reindex-from-remote should use integer time values for the scroll timeout.
- Skip hidden files/directories in the
pluginsdirectory when spawning native controllers - this has been removed in master.
- Upgraded Log4j to 2.8.2.
- Hidden files/directories are no longer allowed in the
- Removed deprecated S3/EC2
- Removed deprecated EC2
- Cross-cluster search remote cluster info API and support for wildcard cluster names.
- Automatically adjust search threadpool size to route around degraded nodes.
Elasticsearch master is being upgraded to a snapshot of Lucene 7.0
With Lucene 7.0 coming soon, we want to make sure that we are not using Lucene in ways that won't be allowed anymore in Lucene 7. This will also give Lucene 7 more integration tests before it gets released. For the record, you can read about what Lucene 7 will bring at http://blog.mikemccandless.com/2017/03/apache-lucene-70-is-coming-soon.html.
- The unified highligter cannot highlight multi-term queries, if they are wrapped in a BoostQuery.
- Being able to use a different stored field with the unified highlighter can be useful when dealing with multi-lingual content.
- Grouping collectors could be simplified.
- SortedDocValues.ordValue is inconsistent with other similar APIs by not throwing an exception.
- Can we enhance IndexUpgradeMergePolicy to not require the IndexWriter to be closed to perform an upgrade?
- Norms could be encoded more accurately now that index-time boosts have been removed but the backward-compatibility layer is not simple.
- Adding the index creation version to segments metadata forced us to refuse to merge indices that had been created by different versions, since there would be no index creation version that would make sense afterwards. So we are now only storing the major version that was used at index creation time, and adding metadata to the segments about the oldest Lucene version that contributed to this segment.
- Can suffix arrays help speed up infix queries?
- Now that two-phase iteration has matured, should we remove Scorer.iterator() and only iterate matching docs with two-phase iteration?
- Javadocs of the maximum token length of StandardAnalyzer are being updated to better describe the effect of this parameter.
- Recent range query improvements will now work with the query cache.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!