This Week in Elasticsearch and Apache Lucene - 2017-04-10
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
A bit about recent improvements for when #Elasticsearch range queries are used in conjunctions: https://t.co/bTqbEKxl7t
— elastic (@elastic) April 10, 2017
Batched reduction of search results
Elasticsearch has gained the ability to reduce the top N shard results incrementally. This is the next step towards reducing intermediate memory consumption during search execution that will allow us to relax current limitations related to the number of shards that are involved in the execution of a single search request. Ultimately this will allow search executions against large number of shards or in other words hitting the entire cluster with sustainable memory consumption on the coordinating node.
_field_stats is dead, long live _field_caps
After several PRs, the _field_stats API has been deprecated in favour of a more lightweight API that only exposes unique field names across indices/types, and whether they are searchable
and/or aggregatable
, called the _field_caps
API. One important bit of the _field_stats API was the ability to return the min/max values of fields alongside filtering capabilities so that Kibana could figure out which time-based indices a query would match based on its time filter. This was important as including too many indices in a search request could cause memory issues since Elasticsearch would have to hold all shard responses in memory on the coordinating node. In order to work around this issue, we introduced incremental reduction of shard responses in order to keep memory usage on the coordinating node under control. As a consequence, Kibana should now be able to query all indices all the time, and Elasticsearch will make sure to execute things efficiently.
A common misunderstanding is that this change will make queries more costly since they now have to go to all shards. However it was already the case with the _field_stats
API. Now, queries that have a time filter that does not match any documents will return instantly, so we do not expect any performance regression.
Non-deterministic distributions in Rally
Rally gained support for modelling non-deterministic distributions of operations. With this change you can model Poisson processes with Rally out of the box. Poisson processes are often used to model arrival processes of independent actors (think coffee shops, telephone hotlines and most importantly for us: Web services) with a defined mean "arrival rate" (i.e. throughput). They play a central role in queuing theory.
Changes in 5.3:
- Disable graph analysis of a shingles or CJK token filter with mixed shingle lengths to prevent OOMs.
- Lucene upgraded to 6.4.2.
- Closing a ReleasableBytesStreamOutput closes and releases the underlying BigArray
- The
filter
andsignificant_terms
aggs should parse thefilter
element as a filter, not as a query. - Custom sorting has been replaced with SortedSetSortField and SortedNumericSortField where possible.
- Only the master node should be validating the
minimum_master_nodes
setting. - Response headers (eg deprecation warnings) should be preserved when creating an index and updating index settings.
- A new
single_node
discovery type disables bootstrap checks so that users can test the transport client against a node started in Docker. - The process for merging sorted top docs has been simplified to use the same code path regardless of whether there are multiple shards or only a single shard with results.
- The percolator now supports
range
queries which usenow
. - Settings marked with the
Final
property cannot be updated. - Sensitive EC2 settings are now stored in the Elasticsearch keystore.
- InternEngine's index/delete flow has been refactored for better clarity.
- A
partial
snapshot that included some successful and some failed indices now reports the snapshot status asfailed
instead of throwing a NoSuchFileException. - Disabled retries in the S3 blob store as the Amazon S3 client already does retries.
- Reindex-from-remote should use integer time values for the scroll timeout.
- Skip hidden files/directories in the
plugins
directory when spawning native controllers - this has been removed in master. - Upgraded Log4j to 2.8.2.
- Hidden files/directories are no longer allowed in the
plugins
directory. - Removed deprecated S3/EC2
signer
settings. - Removed deprecated EC2
region
setting.
- Cross-cluster search remote cluster info API and support for wildcard cluster names.
- Automatically adjust search threadpool size to route around degraded nodes.
Apache Lucene
Elasticsearch master is being upgraded to a snapshot of Lucene 7.0
With Lucene 7.0 coming soon, we want to make sure that we are not using Lucene in ways that won't be allowed anymore in Lucene 7. This will also give Lucene 7 more integration tests before it gets released. For the record, you can read about what Lucene 7 will bring at http://blog.mikemccandless.com/2017/03/apache-lucene-70-is-coming-soon.html.
Other changes:
- The unified highligter cannot highlight multi-term queries, if they are wrapped in a BoostQuery.
- Being able to use a different stored field with the unified highlighter can be useful when dealing with multi-lingual content.
- Grouping collectors could be simplified.
- SortedDocValues.ordValue is inconsistent with other similar APIs by not throwing an exception.
- Can we enhance IndexUpgradeMergePolicy to not require the IndexWriter to be closed to perform an upgrade?
- Norms could be encoded more accurately now that index-time boosts have been removed but the backward-compatibility layer is not simple.
- Adding the index creation version to segments metadata forced us to refuse to merge indices that had been created by different versions, since there would be no index creation version that would make sense afterwards. So we are now only storing the major version that was used at index creation time, and adding metadata to the segments about the oldest Lucene version that contributed to this segment.
- Can suffix arrays help speed up infix queries?
- Now that two-phase iteration has matured, should we remove Scorer.iterator() and only iterate matching docs with two-phase iteration?
- Javadocs of the maximum token length of StandardAnalyzer are being updated to better describe the effect of this parameter.
- Recent range query improvements will now work with the query cache.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!