25 May 2018

This Week in Elasticsearch and Apache Lucene - 2018-05-25

By Tom CallahanAdrien Grand

Elasticsearch

HTTP Pipelining always enabled in 7.0

HTTP Pipelining is enabled by default today. However, it is possible for users to disable HTTP pipelining in Elasticsearch, and then send send multiple requests to Elasticsearch at the same time from the client (i.e., behave like a pipeline-enabled client). In this situation, Elasticsearch’s behavior is undefined. In the interests of preventing confusion and error, we merged a PR to make HTTP Pipelining no longer be configurable on Elasticsearch and therefore always enabled.

Making context mandatory in the context suggester in 7.0

Querying a context enabled completion field without context is slow. While this is documented, it is also dangerous. Accordingly, we have decided to deprecate this behavior in 6.x and remove it in 7. If querying all context is necessary, it will still be possible to add a special "match_all" category context to all the suggestions at index time. This approach will require reindexing, but also be efficient as compared to the previous approach which was not designed to handled the speed required for completion suggesters.

Keyword splitting on whitespace at query time

In 6.x, Elasticsearch moved from splitting query_string queries on whitespace to using the normalizer only. Therefore, as of 6.x, a simple query like q=keyword_field:(new york) now creates a single term query "new york" targeting the keyword field. While this is intuitive, some users built functionality on top of the old behavior. Therefore, we have added an option to the keyword mappings that indicates a whitespace tokenizer should be used at query time. This works with all full text query parsers and doesn’t break the multi-word analysis of text fields.

Plugin Signature Verification

We added support for verifying signatures on official plugins during plugin installation. Today we sign our artifacts with our gpg key which means that users have a way to validate the integrity and authenticity of our artifacts after they have downloaded them over the Internet. Therefore, this week we added this: any time a user installs an official plugin (e.g., bin/elasticsearch-plugin install analysis-icu) over the Internet on a release, or snapshot build (or for an internal staged release build), we check that the downloaded bits have a valid signature by the expected key.

Cross-cluster replication benchmarking

The CCR team has gotten to the point of being able to benchmark our in-development cross-cluster replication feature, transferring 30GB of data between regions in Google Cloud. While much work yet remains, this is an important milestone as it allows the team to iterate on the default parameters that will be critical to out-of-the-box performance.

Improved authentication handling

We made a change to our authentication layer that will prevent nodes from making multiple simultaneous authentication requests to external systems (such as LDAP) for the same user. While we already cached successful authentications, a few scenarios (such as many metricbeat instances connecting to ES at the same time) could still cause periodic spikes in load due to these duplicative authentication requests.

Changes

Changes in 5.6:

  • Use correct cluster state version for node fault detection #30810

Changes in 6.3:

  • Security: fix dynamic mapping updates with aliases #30787
  • Move watcher-history version setting to _meta field #30832
  • [Security] Include an empty json object in an json array when FLS filters out all fields #30709
  • SQL: Preserve scoring in bool queries #30730
  • Upgrade to lucene-7.3.1 #30729

Changes in 6.4:

  • Modify state of VerifyRepositoryResponse for bwc #30762
  • REST high-level client: add put ingest pipeline API #30793
  • Use remote client in TransportFieldCapsAction #30838
  • Limit user to single concurrent auth per realm #30794
  • [Tests] Move templated _rank_eval tests #30679
  • Ensure that ip_range aggregations always return bucket keys. #30701
  • Force stable file modes for built packages #30823
  • Send client headers from TransportClient #30803
  • Add support for indexed shape routing in geo_shape query #30760
  • Add a format option to docvalue_fields. #29639
  • Only ack cluster state updates successfully applied on all nodes #30672
  • Replace Request#setHeaders with addHeader #30588
  • Reduce CLI scripts to one-liners on Windows #30772
  • [Feature] Adding a char_group tokenizer #24186
  • Increase the maximum number of filters that may be in the cache. #30655
  • Enable installing plugins from snapshots.elastic.co #30765
  • Ignore empty completion input #30713
  • Add Delete Repository High Level REST API #30666
  • Reduce CLI scripts to one-liners #30759

Changes in 7.0:

  • BREAKING: Use geohash cell instead of just a corner in geo_bounding_box #30698
  • Reintroduce mandatory http pipelining support #30820
  • Expose Lucene’s FeatureField. #30618
  • Simplify number of shards setting #30783
  • Make http pipelining support mandatory #30695
  • BREAKING: Scripting: Remove getDate methods from ScriptDocValues #30690

Lucene

Impacts for synonym and phrase queries

The fact that codecs expose raw impacts enables to speed up more queries when the total hit count is not tracked. For instance merging impacts by taking the sum of term frequencies for all involved terms allows to compute upper bounds of scores of synonym queries, which are typically created by query parsers for terms that occur at the same position. This in-turn allows to speed up synonym queries quite a bit by skipping blocks of documents whose score upper bound is less than the current minimum score that is required for a hit to be competitive. We explored doing the same for phrase queries by taking the minimum of term frequency of all involved terms, which also looks promising even though the speed up is less spectacular than for synonym queries.

Other