2017년 4월 10일

This Week in Elasticsearch and Apache Lucene - 2017-04-10

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

A bit about recent improvements for when #Elasticsearch range queries are used in conjunctions: https://t.co/bTqbEKxl7t
— elastic (@elastic) April 10, 2017

Batched reduction of search results

Elasticsearch has gained the ability to reduce the top N shard results incrementally. This is the next step towards reducing intermediate memory consumption during search execution that will allow us to relax current limitations related to the number of shards that are involved in the execution of a single search request. Ultimately this will allow search executions against large number of shards or in other words hitting the entire cluster with sustainable memory consumption on the coordinating node.

_field_stats is dead, long live _field_caps

After several PRs, the _field_stats API has been deprecated in favour of a more lightweight API that only exposes unique field names across indices/types, and whether they are searchable and/or aggregatable, called the _field_caps API. One important bit of the _field_stats API was the ability to return the min/max values of fields alongside filtering capabilities so that Kibana could figure out which time-based indices a query would match based on its time filter. This was important as including too many indices in a search request could cause memory issues since Elasticsearch would have to hold all shard responses in memory on the coordinating node. In order to work around this issue, we introduced incremental reduction of shard responses in order to keep memory usage on the coordinating node under control. As a consequence, Kibana should now be able to query all indices all the time, and Elasticsearch will make sure to execute things efficiently. A common misunderstanding is that this change will make queries more costly since they now have to go to all shards. However it was already the case with the _field_stats API. Now, queries that have a time filter that does not match any documents will return instantly, so we do not expect any performance regression.

Non-deterministic distributions in Rally

Rally gained support for modelling non-deterministic distributions of operations. With this change you can model Poisson processes with Rally out of the box. Poisson processes are often used to model arrival processes of independent actors (think coffee shops, telephone hotlines and most importantly for us: Web services) with a defined mean "arrival rate" (i.e. throughput). They play a central role in queuing theory.

Changes in 5.3:

Disable graph analysis of a shingles or CJK token filter with mixed shingle lengths to prevent OOMs.
Lucene upgraded to 6.4.2.
Closing a ReleasableBytesStreamOutput closes and releases the underlying BigArray
The filter and significant_terms aggs should parse the filter element as a filter, not as a query.
Custom sorting has been replaced with SortedSetSortField and SortedNumericSortField where possible.
Only the master node should be validating the minimum_master_nodes setting.
Response headers (eg deprecation warnings) should be preserved when creating an index and updating index settings.

Changes in 5.x:

A new single_node discovery type disables bootstrap checks so that users can test the transport client against a node started in Docker.
The process for merging sorted top docs has been simplified to use the same code path regardless of whether there are multiple shards or only a single shard with results.
The percolator now supports range queries which use now.
Settings marked with the Final property cannot be updated.
Sensitive EC2 settings are now stored in the Elasticsearch keystore.
InternEngine's index/delete flow has been refactored for better clarity.
A partial snapshot that included some successful and some failed indices now reports the snapshot status as failed instead of throwing a NoSuchFileException.
Disabled retries in the S3 blob store as the Amazon S3 client already does retries.
Reindex-from-remote should use integer time values for the scroll timeout.
Skip hidden files/directories in the plugins directory when spawning native controllers - this has been removed in master.
Upgraded Log4j to 2.8.2.

Changes in master:

Hidden files/directories are no longer allowed in the plugins directory.
Removed deprecated S3/EC2 signer settings.
Removed deprecated EC2 region setting.

Coming soon:

Cross-cluster search remote cluster info API and support for wildcard cluster names.
Automatically adjust search threadpool size to route around degraded nodes.

Apache Lucene

Elasticsearch master is being upgraded to a snapshot of Lucene 7.0

With Lucene 7.0 coming soon, we want to make sure that we are not using Lucene in ways that won't be allowed anymore in Lucene 7. This will also give Lucene 7 more integration tests before it gets released. For the record, you can read about what Lucene 7 will bring at http://blog.mikemccandless.com/2017/03/apache-lucene-70-is-coming-soon.html.

Other changes:

The unified highligter cannot highlight multi-term queries, if they are wrapped in a BoostQuery.
Being able to use a different stored field with the unified highlighter can be useful when dealing with multi-lingual content.
Grouping collectors could be simplified.
SortedDocValues.ordValue is inconsistent with other similar APIs by not throwing an exception.
Can we enhance IndexUpgradeMergePolicy to not require the IndexWriter to be closed to perform an upgrade?
Norms could be encoded more accurately now that index-time boosts have been removed but the backward-compatibility layer is not simple.
Adding the index creation version to segments metadata forced us to refuse to merge indices that had been created by different versions, since there would be no index creation version that would make sense afterwards. So we are now only storing the major version that was used at index creation time, and adding metadata to the segments about the oldest Lucene version that contributed to this segment.
Can suffix arrays help speed up infix queries?
Now that two-phase iteration has matured, should we remove Scorer.iterator() and only iterate matching docs with two-phase iteration?
Javadocs of the maximum token length of StandardAnalyzer are being updated to better describe the effect of this parameter.
Recent range query improvements will now work with the query cache.

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

Elasticsearch Platform

ELK Stack

Elastic Cloud

Observability

Security

Search

산업별

솔루션별

고객 스포트라이트

개발자

소통

학습

도움말

Elastic에서 어떤 일이 진행되고 있는지 확인

This Week in Elasticsearch and Apache Lucene - 2017-04-10

Apache Lucene

팔로우하기

회사 소개

참여하기

언론

파트너

신뢰 및 보안

IR 정보

EXCELLENCE AWARDS

회사 소개

참여하기

언론

파트너

신뢰 및 보안

IR 정보

EXCELLENCE AWARDS