This Week in Elasticsearch and Apache Lucene - 2019-03-15

Elasticsearch

Zen 2

We published a blog post about Zen2. As the blog title suggests, it is truly a new era for cluster coordination in Elasticsearch.

We also published docs for the "elasticsearch-node" command line tool that allows to unsafely recover a cluster after a majority of master-eligible nodes have been permanently lost.

Snapshot Lifecycle Management

We now have a feature branch for snapshot lifecycle management, named snapshot-lifecycle-management. It's the first step towards having it as a feature and includes the framework for put/get/delete snapshot lifecycles.

SQL

We fixed a bug that made SQL report an incorrect error when a field in a hierarchy of sub-fields contained an array of values (as opposed to a single value). This was a bit more challenging than reporting the correct error type because the entire idea of a sub-field containing an array of values or other sub-fields was ignored. This was discovered while testing the ODBC driver with Kibana's default sample data. The PR is not yet merged due to a infra bug that makes integration testing in Windows impossible.

We implemented support for Cloud ID. With this, a user will be able to simply use the string copied from the cloud admin interface to configure the connection to the cloud; the driver will then fill in the address of the Elasticsearch cluster, the port and connection security settings.

We fixed two issues regarding scripts, namely wrapping of ZonedDateTime and handling of NamedWriteable

We worked on introducing the SQL TIME data type which eliminates the date info from a timestamp and enables users to do various calculations with it, e.g. :

SELECT * FROM t WHERE CAST(my_date AS TIME) < CAST('10:00:00' AS TIME)

SELECT count(*), CAST(my_date AS TIME) as time FROM t GROUP BY time

Also during this implementation, investigation led to the discovery of a bug regarding timezones and java.sql.Date objects in the JDBC driver, which led to a bug fix.

We opened a PR for adding STDistance function* (e.g. distance between two points). We also fixed a bug in handling timestamps in aggregations that he stumbled upon while working on adding ST*Distance.

Test Test Test!!

We continued work to add tests to our apps. This week the effort was focused on adding API integration testsforrollups. We also have a PR open that moves testbed utils to a common folder, x-pack/test_utils, so that other apps can use them. We've also been performing a lot of work testing upgrades of the stack to 7.0.

Cross-Cluster Search

We opened a PR that concludes the work on minimizing round-trips for CCS requests. This is the last step mentioned in the meta issue about comparing output from the two "execution modes". The new integration tests already caught a couple of issues: error upper bound may be wrong when performing incremental reductions, also dfsreducethenfetch gives different scoring when minimizing round-trips. Also, tophits does different tiebreaking hence results are sorted differently when tophits have the same score. It's all minor things luckily, yet good to know and to think about addressing.

Doc Values for Geo Shapes

Now that BKD-based geoshapes are in, we want to start working on adding doc values to them. This will, in short, allow shapes to be aggregated upon (and there are lots of neat aggregations you could do on shapes). We started working on a POC of doc value support. The POC uses existing Lucene data structures, but the final implementation will use more custom and serialization-friendly datastructures for ES. Now that the POC stage is done, next step is to work on the actual implementation. We intend to start a new feature-branch to contain the first release, which will bring parity with the geo-point aggregations. This will start with the new serializable edge-tree.

Cross-Cluster Replication

We added documentation for the CCR index following lifecycle, which for example includes instructions on how to fix a follower index that has fallen too far behind its leader index.

We fixed a rolling upgrade issue between 5.6 and 6.7 where the retention leases background sync was returning an illegal checkpoint.

Nightly Benchmarks for ML

We started working on setting up ML jobs and alerts for our nightly results to automatically help us detect anomalies. This important automation will make it much easier to catch any ML-related performance regressions.

Types Removal

We worked on two changes related to the types removal effort:

  • The monitoring bulk API no longer emits deprecation warnings when a _type is provided. This might look surprising but this is due to the fact that the bulk monitoring API reuses the parsing logic of the bulk API but uses the _type in a way that has nothing to do with mapping types, so this warning was not relevant.
  • The percolate query no longer requires a type when percolating an existing document.

Queryable Object Fields

We resumed the work on the queryable object field. We added some short-circuiting logic to prevent slow lookups on non-existent fields. We also started to work on adding docvalues fields support for this field so we could support sorting and simple aggregations like 'terms" that make sense for keyword style fields.

Lucene

Major Release!

The Lucene release vote for 8.0.0 has passed and we are working on the next steps to wrap up this major release. The artifacts are already published and they are currently replicating on the different mirrors. We are planning to publicly announce the release on Friday March 15th. Elasticsearch 7 and higher has already been upgraded to the release artifacts.

Analysis

We merged a patch that adds the ability to de-boost specific terms in the SynonymQuery. We plan to use this new feature in Elasticsearch to ensure that the original term gets a better score than its synonym expansions. We closed & merged a community PR that changes how we match words in the Korean's user dictionary. Instead of adding all the matches at each offset we now select the longest match and discards the other. This allows to add compound terms in the user dictionary along with their de-compounded forms while ensuring that the compound term is prioritized. We've worked on a possible solution to stop-words in the token graph which surfaced a bug in FixedShingleFilter.

Query Visitor API

A group of committers are iterating on the final API to allow efficiently walking query trees. We seem to be close and expect this to land in the next couple of days. Now the discussion about back-porting and backwards compatibility has started.

Geo

We fixed a bug where sometimes we missed the intersection between shapes when the crossing happens at a segment point and added tests to the Line2D implementation which is the object containing the spatial logic for lines. The support for Contains seems to be in good shape and given the progress we expect it to land by the end of next week.

Other Changes

We opened a PR to allow more fine-grained control over term-dictionaries being ram resident or read off-heap. The change seems controversial but hasn't seen many comments not iterations yet.

We are working on fixing a bug in ValueSource that can cause IndexSearcher references to leak.

We are still trying to figure out what we should do in case of an exception during IndexWriter#commit and friends.

Blog

Toke Eskildsen, a long-term member of the Lucene community, published a blog post about the changes that he did to the codec in order to speed up advancing on sparse doc values.

Changes

Changes in Elasticsearch

Changes in 7.1:

  • Add a Painless Context REST API #39382
  • Fix not Recognizing Disabled Object Mapper #39862
  • Handle UTF-8 values in the keystore #39496
  • GEO: Add support for z values to libs/geo classes #38921
  • Ensure sendBatch not called recursively #39988
  • Add flag to declare token filters as updateable #36103
  • Avoid copying the field alias lookup structure unnecessarily. #39726
  • Only connect to new nodes on new cluster state #39629
  • Log missing file exception when failing to read metadata snapshot #32920
  • Do not swallow exceptions in TimedRunnable #39856

Changes in 7.0:

  • CCS: Disable minimizing round-trips when dfs is requested #40044
  • Some elasticsearch-cli tools could not be run not from ES_HOME #39937
  • Fix node tool cleanup #39389
  • Upgrade to Lucene 8.0.0 #39992
  • Do not log unsuccessful join attempt each time #39756
  • Make the type parameter optional when percolating existing documents. #39987
  • Add client jar for transport-nio #39860
  • Change zone formatting for all printers #39568
  • Remove types from internal monitoring templates and bump to api 7 #39888
  • [DOCS] Fixes breaking changes for low level client #39930

Changes in 6.7:

  • BREAKING: Stop returning cluster state size by default #40016
  • Upgrade the bouncycastle dependency to 1.61 #40017
  • Deprecation check for indices with very large numbers of fields #39869
  • Enforce retention leases require soft deletes #39922
  • plugins/repository-gcs: Update google-cloud-storage/core to 1.59.0 #39748

Changes in 6.6:

  • SQL: Fix bug with JDBC timezone setting and DATE type #39978
  • SQL: Extend the multi dot field notation extraction to lists of values #39823
  • SQL: Wrap ZonedDateTime parameters inside scripts #39911
  • SQL: ConstantProcessor can now handle NamedWriteable #39876

Changes in Elasticsearch Management UI

Changes in 7.1:

  • [Rollup] Redux integration test job creation #32223
  • [Rollup] Api integration test wildcard + search #32884
  • [Rollup] API integration tests #32780
  • Instrument Index Management with user action telemetry #32595
  • Remove license restrictions from License Management. #33046
  • Instrument Rollups CRUD app with user action telemetry #32534
  • Fix 'Create Rollup Index Pattern' button badge color error #32954
  • remove rollup section in advanced settings for OSS #32814

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.1:

  • Add support for Cloud ID #125

Changes in 6.6:

  • Add dependencies reporting batch #126