May 25, 2018

This Week in Elasticsearch and Apache Lucene - 2018-05-25

•

Elasticsearch

HTTP Pipelining always enabled in 7.0

HTTP Pipelining is enabled by default today. However, it is possible for users to disable HTTP pipelining in Elasticsearch, and then send send multiple requests to Elasticsearch at the same time from the client (i.e., behave like a pipeline-enabled client). In this situation, Elasticsearch’s behavior is undefined. In the interests of preventing confusion and error, we merged a PR to make HTTP Pipelining no longer be configurable on Elasticsearch and therefore always enabled.

Making context mandatory in the context suggester in 7.0

Querying a context enabled completion field without context is slow. While this is documented, it is also dangerous. Accordingly, we have decided to deprecate this behavior in 6.x and remove it in 7. If querying all context is necessary, it will still be possible to add a special "match_all" category context to all the suggestions at index time. This approach will require reindexing, but also be efficient as compared to the previous approach which was not designed to handled the speed required for completion suggesters.

Keyword splitting on whitespace at query time

In 6.x, Elasticsearch moved from splitting query_string queries on whitespace to using the normalizer only. Therefore, as of 6.x, a simple query like q=keyword_field:(new york) now creates a single term query "new york" targeting the keyword field. While this is intuitive, some users built functionality on top of the old behavior. Therefore, we have added an option to the keyword mappings that indicates a whitespace tokenizer should be used at query time. This works with all full text query parsers and doesn’t break the multi-word analysis of text fields.

Plugin Signature Verification

We added support for verifying signatures on official plugins during plugin installation. Today we sign our artifacts with our gpg key which means that users have a way to validate the integrity and authenticity of our artifacts after they have downloaded them over the Internet. Therefore, this week we added this: any time a user installs an official plugin (e.g., bin/elasticsearch-plugin install analysis-icu) over the Internet on a release, or snapshot build (or for an internal staged release build), we check that the downloaded bits have a valid signature by the expected key.

Cross-cluster replication benchmarking

The CCR team has gotten to the point of being able to benchmark our in-development cross-cluster replication feature, transferring 30GB of data between regions in Google Cloud. While much work yet remains, this is an important milestone as it allows the team to iterate on the default parameters that will be critical to out-of-the-box performance.

Improved authentication handling

We made a change to our authentication layer that will prevent nodes from making multiple simultaneous authentication requests to external systems (such as LDAP) for the same user. While we already cached successful authentications, a few scenarios (such as many metricbeat instances connecting to ES at the same time) could still cause periodic spikes in load due to these duplicative authentication requests.

Changes

Changes in 5.6:

Use correct cluster state version for node fault detection #30810

Changes in 6.3:

Security: fix dynamic mapping updates with aliases #30787
Move watcher-history version setting to _meta field #30832
[Security] Include an empty json object in an json array when FLS filters out all fields #30709
SQL: Preserve scoring in bool queries #30730
Upgrade to lucene-7.3.1 #30729

Changes in 6.4:

Modify state of VerifyRepositoryResponse for bwc #30762
REST high-level client: add put ingest pipeline API #30793
Use remote client in TransportFieldCapsAction #30838
Limit user to single concurrent auth per realm #30794
[Tests] Move templated _rank_eval tests #30679
Ensure that ip_range aggregations always return bucket keys. #30701
Force stable file modes for built packages #30823
Send client headers from TransportClient #30803
Add support for indexed shape routing in geo_shape query #30760
Add a format option to docvalue_fields. #29639
Only ack cluster state updates successfully applied on all nodes #30672
Replace Request#setHeaders with addHeader #30588
Reduce CLI scripts to one-liners on Windows #30772
[Feature] Adding a char_group tokenizer #24186
Increase the maximum number of filters that may be in the cache. #30655
Enable installing plugins from snapshots.elastic.co #30765
Ignore empty completion input #30713
Add Delete Repository High Level REST API #30666
Reduce CLI scripts to one-liners #30759

Changes in 7.0:

BREAKING: Use geohash cell instead of just a corner in geo_bounding_box #30698
Reintroduce mandatory http pipelining support #30820
Expose Lucene’s FeatureField. #30618
Simplify number of shards setting #30783
Make http pipelining support mandatory #30695
BREAKING: Scripting: Remove getDate methods from ScriptDocValues #30690

Lucene

Impacts for synonym and phrase queries

The fact that codecs expose raw impacts enables to speed up more queries when the total hit count is not tracked. For instance merging impacts by taking the sum of term frequencies for all involved terms allows to compute upper bounds of scores of synonym queries, which are typically created by query parsers for terms that occur at the same position. This in-turn allows to speed up synonym queries quite a bit by skipping blocks of documents whose score upper bound is less than the current minimum score that is required for a hit to be competitive. We explored doing the same for phrase queries by taking the minimum of term frequency of all involved terms, which also looks promising even though the speed up is less spectacular than for synonym queries.

Other

There was a concurrency issue in the way that we publish deletes for merges.
A user reported that SmartChineseAnalyzer doesn't deal correctly with surrogate pairs. The team helped fix it so that its name is less embarrassing. :)
IndexWriter is a quite a beast with complex dependencies with merge policies, merge schedulers, deletion policies, etc. There is a proposal to clean this up a bit by detaching IndexWriter from merge policies.
TestRandomChains continues to find some corner cases with the new ConditionalTokenFilter.
We wonder whether we could add a multiplexing token filter, in order for a token filter to feed multiple children with the same tokens.
There are ongoing discussions about the way we should expose matching terms for the current position in the new matches API.
Lucene's mock WindowsFS was assuming that the inode of a file would change when a file is moved, but this is not true with HardLinkCopyDirectoryWrapper which copies files using hard links.
We have a proposal for a new ConcatenateFilter to concatenate the content of all produced tokens.
Unreferenced files may remain in the index after a commit because of recent changes, which boiled down to a missing checkpoint. It is now fixed.
We are discussing the impact of handling more than 2B documents in a single index. There are concerns about making this number unbounded, but maybe we could go with a higher limit such as 16B.