2017年6月6日

This Week in Elasticsearch and Apache Lucene - 2017-06-05

著者

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Elastic Cloud Enterprise–#ElasticCloud run by you on your own environment, on your boxes, for your users. Learn more https://t.co/UZrNYHOKi8 pic.twitter.com/N6bUyWTHhe
— elastic (@elastic) June 2, 2017

Typeless parent/child support

With the removal of types, we had to come up with a way to support parent/child joins that didn't depend on having multiple mapping types in an index. Good progress has been made in 5.5 with the introduction of the ParentJoinFieldMapper, a field mapper that creates parent/child relation within documents of the same index. Only one join-field is allowed per index. The join field supports eager_global_ordinals, which defaults to true. Parent/child queries and aggregations have been updated to work with the new join field. Existing 5.x parent/child indices will continue to work as before in 6.x.

Changes in 5.4:

The context suggester can now read contexts from keyword fields.
When a terms agg has a sub-agg which can't be deferred use global_ordinals_hash instead of global_ordinals to reduce memory requirements.
The circuit breaker should be incremented before allocating bytes to a BigArray.
PatternAnalyzer should lowercase wildcard queries when lowercase is true.
The number of processes is now set in the systemd unit file.

Changes in 5.x:

POST requests with missing content no longer throw an NPE.
The kv ingest processor now aligns with Logstash by supporting exclude_keys.
The Ingest date processor used floats instead of doubles, which led to approximate dates.
The moving_avg agg didn't always set the correct doc_count.
Field collapsing search requests did not include the time it took to retrieve inner hits.
Added XContent serialization for search-scroll request, clear scroll request, and clear scroll response.
Include the name of the duplicate JAR causing jarhell.
Eliminate array access in tight loops when profiling is enabled.
The --purge option to elasticsearch-plugin will remove config files when uninstalling a plugin.
Always accumulate transport exceptions to avoid the situation where no response is returned due to failure, but no exceptions are reported.

Changes in master:

[BREAKING] The get-index API no longer allows comma-separated values for _mapping, _settings, _aliases.
[BREAKING] TokenizerFactory#name has been removed in favour of CustomAnalyzer#getTokenizerName.
The new nodes-usage API provides information about how often different features are used. Currently this is limited to counting calls to REST endpoints, but these stats can be expanded as needed.
The Java High Level REST client now supports search,
The primary term is only incremented once all in-flight requests have completed, allowing for a clean transition of primary term.
If the primary throws an exception while handling the response from a replica, the primary should be failed, not the replica.
When a replica is promoted to primary, all gaps in the sequence number history should be filled with noops.
The IndexDeletionPolicy should be internal to the engine, not passed in from IndexShard.
The decision to delete a transaction log is no longer hard coded, but contained in the new TranslogDeletionPolicy.
The matrix_stats aggregation now reports the doc_count, and the significant terms agg reports the superset_size.
Since the Lucene 7 upgrade, the script-sort always returned Double.MAX_VALUE when the script returned a number.
Stored scripts take an optional context parameter which, when present, will cause the script to be compiled and validated at PUT time. Painless scripts are able to use these script contexts.
PainlessScript is now an interface, and the Painless compiler uses an instance per context.
Add StatefulFactoryType as optional intermediate factory in script contexts to support search scripts which create a script per Lucene segment.
Search response times are now tracked as an EWMA, which will be used to auto-resize the search threadpool queue.
Plugins can now register pre-configured character filters.

Apache Lucene

Lucene 7.0

We expect to cut the branch during or soon after Berlin Buzzwords.

Lucene 6.6.0

After several respins, it looks like the vote for 6.6.0 is about to pass.

Custom term frequencies

Lucene will soon allow you to index custom term frequencies. This is useful if you want to use term frequencies to hold scoring signals. The only way that it could be done before was by using payloads, which come with a higher cost in terms of CPU usage and disk requirements. However this remains a very expert feature, which only works with the DOCS_AND_FREQS index options.

Module dependencies

The introduction of a dependency from the classification module on the sandbox module triggered a discussion about when it is fine for modules to depend on each other.

Other changes:

Our copyright notices were outdated.
There is disagreement about whether the terms index should be loaded lazily in memory.
Could we detect duplicate postings lists? For instance, this happens by design if you index the same field both normally and with a ReverseStringFilter in order to support suffix queries.
Should BKD trees store the actual bounds in each dimension on the leaves instead of relying on approximate values provided by the BKD index?
IndexSearcher exposes a way to customize how searching a single index can be parallelized, but this feature can be used in such a way that it bre aks searchAfter.
Range queries on range fields could get faster by short -circuiting the query on inner nodes when all points either match or do not match.
Support for legacy (postings-based rather than points) has been removed from Lucene but will stay in Solr for one additional major version.
The factory of the wikipedia tokenizer does not expose all its options.
Gaps in graph phrase queries were ign ored.
Can we optimize norms for the case that values have less than 16 terms?
Should we move to doc values to store block-join relations?
There is ongoing work on having a geo bounding box fiel d.
The introspection API of TermInSetQuery is not good.
Join queries had equals/hashcode bugs.

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

Elasticsearch Platform

ELK Stack

Elastic Cloud

オブザーバビリティ

セキュリティ

Search

業界別

ソリューション別

お客様事例

開発者

つながる

学習

ヘルプ

Elasticの最新情報

This Week in Elasticsearch and Apache Lucene - 2017-06-05

Apache Lucene

SNSリンク

会社概要

参加する

報道資料

パートナー

信頼とセキュリティ

投資家向け情報

Excellence Awards