This Week in Elasticsearch and Apache Lucene - 2019-05-24

Elasticsearch

Highlights

Deprecation info API

As part of an effort to keep the deprecation info API up to date as we go (as opposed to a last minute push),we opened an issue to track breaking changes in 8.0 that need additions to this API (#42404).

Enrich

We have merged the PR that adds a force merge step to the enrich policy execution (#41969).

Work has also started on integrating security with the enrich feature as is described here.

We've started looking into how processor/pipeline can indicate that it needs to run on a different node. This is needed, because in the case when a write request with a pipeline is sent to a node that doesn't have an enrich index shard or that shard isn't ready yet. This should work similarly to how we forward write requests with a pipeline from a non ingest node to an ingest node.

Watcher UI

Work has continued on making improvements to Watcher. Recently this focused on the watch list view) and threshold alert.

watcher ui alert threshold improvements

Snapshot Repositories UI

We have instrumented Snapshot Repositories with UI metrics, and also worked on adding logic to warn users and disable delete for a cloud-managed repository.

cloud snapshot repositories UI warning

Bye Bye Transport Client

From 7.0 the high level REST Client has feature parity with the transport client, so we have started on the daunting task to remove the transport client. This will also move us closer to the removal of the Guice library. As part of this, the migrate tool was deprecated in 7.x and removed in master. This tool was originally created to allow early users of security to move from the file realm to the native realm when it was introduced. At this point, the tool does not add much value and its removal simplified removing the transport client from x-pack. We have started by removing the TestXPackTransportClient and then moved on to removing transport client usage in x-pack.

Performance

We merged Rally#688 resolving a major usability issue (Rally#478) where wrong user specified track parameters, e.g. due to a typo, don't show a warning and may end up running a wrong benchmark. We took the extra mile and display a list of the wrong parameters, the entire list of user configurable parameters as well as possible correct alternatives using Levenshtein style approximate string matching.

Types Removal

Work has begun on types removal for 8.0. We have started to remove typed URL paths in the REST layer and have also done some cleanup to completely remove internal references to types in the search APIs.

Multi-fields

We have deprecated+removed support for chained multi-fields. As of 8.0 it won't be possible to define a multi-field within a multi-field. This closes a long standing issue that was opened to limit multi-fields to relevant settings only.

Search low-level cancellation

A PR was merged that activates low-level cancellation for search by default. The change is minimal but it required a lot of benchmarking in order to ensure that it does not impact search performance. Low-level cancellation is required to improve the responsiveness of search cancellation for long running queries and will benefit to systems (Kibana, ...) that cannot change the default settings of an Elasticsearch cluster.

Removal of node.max_local_storage_nodes setting

We have deprecated and removed the node.max_local_storage_nodes setting. This setting, which prior to Elasticsearch 5 was enabled by default and caused all kinds of confusion, has since been disabled by default and is not recommended for production use. The prefered way going forward is for users to explicitly specify separate data folders for each started node to ensure that each node is consistently assigned to the same data path.

The main step in getting rid of this setting was to move the test infrastructure away. This also changes the behavior of ESIntegTestCases so that starting up nodes will not automatically reuse data folders of previously stopped nodes.

Finally, this also opens up the possibility to remove the "nodes/$nodeOrdinal" folder prefix from the data path in 8.0.

Geo

We've merged the initial implementation of the major serialized tree structure for storing polygons as doc-values (#40810 and #42331).

We revisited the tessellation logic after a user reported a failure in a valid polygon in the discuss forum. This new issue showed clearly that the tessellator logic was failing when a hole had a shared vertex with the polygon.  We decided to change the logic when eliminating holes to detect if a hole has a shared vertices and use that vertex as the hole bridge. In order to do that efficiently the algorithm uses the bounding box of the hole to filter out potential candidates. This changed allowed us to remove most of the complex logic we had previously added.

Lucene

Lucene 8.1.1

A 8.1.1 release has been called in order to fix a NPE when a Solr collection is called via an alias. This release has no changes on the Lucene side however.

How to enforce a maximum clause count?

Currently Lucene enforces that boolean queries have at most 1024 clauses. Some users have been working around this limit by using multiple levels of boolean queries so that they can have more than 1024 clauses in total but each boolean query has no more than 1024 clauses so Lucene doesn't complain about the clause count. However we recently introduced a new rewrite rule that flattens disjunctions within disjunctions, effectively defeating this workaround, which is surprising some of our users. We are considering deactivating this rewrite rule when it would hit the clause count limit, but this also triggered a separate discussion about whether we could count clauses across the entire query tree as opposed to on a per-BooleanQuery basis.

Other

  • There is now a way to sort based on the value of a FeatureField. This means that in the future Elasticsearch users will be able to sort on a rank_feature field without having to add a float field in a multi-field as well.
  • There is a lot of work ongoing around getting a working Gradle build.
  • A bug was identified around concurrent rollbacks and reopens causing leakage of segment readers.
  • Javadocs of our analysis components are getting improved.
  • Range fields are getting doc-value support. Elasticsearch has its own, slightly different implementation of doc-value support for range fields due to the fact that it needs to support multiple ranges per document, which the Lucene implementation doesn't support.
  • StoredFieldsVisitor#stringField will now take a String instead of the UTF8 representation of the string as a byte[].
  • The conjunction scorer that leverages maximum per-clause scores was enhanced so that it works better when sub-clauses have two-phase iterators, such as phrase queries.
  • There are ongoing discussions regarding the introduction of a Collector-based rescorer.
  • Can we make codecs more efficiently handle boolean doc values?

Changes

Changes in Elasticsearch

Changes in 8.0:

  • Remove transport client from xpack #42202
  • BREAKING: Remove deprecated search.remote settings #42381
  • BREAKING: Remove node.max_local_storage_nodes #42428
  • Remove deprecated Repository methods #42359
  • Fix TopHitsAggregationBuilder adding duplicate _score sort clauses #42179
  • add 7.1.1 and 6.8.1 versions #42253
  • BREAKING: Remove the migrate tool #42174
  • Add ChaCha20 TLS ciphers on Java 12+ #42155

Changes in 7.3:

  • Bug fix to allow access to top level params in reduce script #42096
  • Avoid HashMap construction on Grok non-match #42444
  • Improve Close Index Response #39687
  • Allow fields to be set to * #42301
  • Bulk processor concurrent requests #41451
  • Rename SearchRequest#crossClusterSearch #42363
  • Fix alpha build error message when generate version object from version string #40406
  • Search - enable low_level_cancellation by default. #42291
  • Deprecate support for chained multi-fields. #41926

Changes in 7.2:

  • Fix settings prefix for realm truststore password #42336
  • Implement XContentParser.genericMap and XContentParser.genericMapOrdered methods #42059
  • Upgrade to Lucene 8.1.0 #42214
  • Add .code_internal-* index pattern to kibana user #42247
  • Remove IndexShard dependency from Repository #42213
  • Merge claims from userinfo and ID Token correctly #42277
  • Peer recovery should flush at the end #41660
  • Use translog to estimate number of operations in recovery #42211
  • [ML Data Frame] Persist and restore checkpoint and position #41942
  • Add experimental and warnings to vector functions #42205
  • Prevent in-place downgrades and invalid upgrades #41731
  • Validate non-secure settings are not in keystore #42209
  • Update to joda time 2.10.2 #42199
  • Fix FiltersAggregation NPE when filters is empty #41459
  • Update ciphers for TLSv1.3 and JDK11 if available #42082
  • Hash token values for storage #41792
  • Do not refresh realm cache unless required #42169
  • Protect logged exec spooling from no output #42177
  • Use local outputstream reference #42180
  • Deprecate the native realm migration tool #42142
  • Hide bwc build output on success #42102
  • Cacheability improvements for thirdparty audit task #42085
  • SQL: Add initial geo support #42031
  • Don't create tempdir for cli scripts #41913
  • Cleanup plugin bin directories #41907
  • Prevent order being lost for _nodes API filters #42045
  • [ML] Complete the Data Frame task on stop #41752

Changes in 7.1:

  • Move the FIPS configuration back to the build plugin #41989
  • 7.1.0 release notes initial commit #42208
  • Implement factory methods for ValidationException #41993

Changes in 7.0:

  • Remove deprecated _source_exclude and _source_include from get API spec #42188
  • Build local year inside DateFormat lambda #42120

Changes in 6.8:

  • Avoid bubbling up failures from a shard that is recovering #42287
  • Recovery with syncId should verify seqno infos #41265
  • Execute actions under permit in primary mode only #42241
  • Avoid unnecessary persistence of retention leases #42299
  • 6.8 release notes #41802
  • Fix max boundary for rollup jobs that use a delay #42158
  • SQL: Fix issue regarding INTERVAL * number #42014
  • Enforce transport TLS on Basic with Security #42150

Changes in Elasticsearch Hadoop Plugin

Changes in 7.3:

  • [DOCS] Updates version attribute file #1296

Changes in 7.1:

  • Fix the mapping client call on ES 5.x and earlier. #1284

Changes in 6.8:

  • Fix incorrect scroll termination for ES 2.x #1287

Changes in Elasticsearch Management UI

Changes in 7.2:

  • Snapshot Repositories UI #34407

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.2:

  • Fix: add pyodbc as instance member to testing class #156
  • DSN parameter: support for "index_include_frozen" #151
  • Introduce geo support #154

Changes in Rally

Changes in 1.2.0:

  • Assume UTC timezone if not specified #693

  • Measure execution time of bulk ingest pipeline #694

  • Fail Rally early if there are unused variables in track-params #688