This Week in Elasticsearch and Apache Lucene - 2018-12-09

Elasticsearch

Elasticsearch Docker announcement

Starting with version 6.6.0 of Elasticsearch, our Docker image build will be located within the main product repository (e.g., github.com/elastic/elasticsearch).  This means that bug reports and feature requests for the Elasticsearch Docker image will also need to go to the main Elasticsearch repository.  Shortly after the cutover, the existing repository will be retired.

Remote Clusters UI

We have finished the full CRUD UI to support managing remote clusters. This will allow users to configure remote clusters to be available for both cross-cluster search and cross-cluster replication. In doing so, we also updated the internal navigation to the new K7 breadcrumb style.

Better validation of analyzer with synonyms

We added a way to fail early when an analysis chain contains a filter that is not compatible with synonyms. The synonym filter that we use cannot handle stacked tokens (tokens that appear at the same position) so any filter that produce such tokens throws an error when it is used to build synonyms. With the new validation we'll be able to tell the users which filter is responsible of the failure and we'll report an error that explain why the filter should be moved. This is an important step to ease the usage of synonyms in our analysis chain but we have a lot more to do to improve the overall experience.

Deprecation Info API

We are upgrading the deprecation info API for all 7.0 breaking changes. Our running list of breaking changes that will be addressed by the API is here. The deprecation info API highlights for users functionality they are using that will be removed or changed on the next major version. This API evolves over time as we make breaking changes to the next major version, and so in order to get complete coverage, users will need to upgrade to the final 6.x minor release in order to capture all 7.0 breaking changes via this API.

Hadoop + Kerberos

We are making great progress on supporting Kerberos in Elasticsearch-Hadoop, this week reaching the milestone of the ES-Hadoop test harness fully supporting Kerberos (except for Pig). This has been a massive undertaking, as evidenced by the current diff of the feature branch.

Zen2

Zen2, our new cluster coordination and consensus layer, was just merged to the master branch of Elasticsearch -- a huge milestone in this feature’s development. In order to get to this point, significant work had to be done to ensure this would not affect the stability of the build. Further development for Zen2 will occur directly in master. Many of our tests already run with Zen2, and we are quickly moving to migrate all tests to Zen2.

We integrated the new cluster state persistence layer into the Zen2 state recovery logic, allowing full-cluster restart tests to properly work now as well. This completes the work on the new persistence layer, with just a little bit more follow-up work to do on failure handling.

Snapshot/Restore reliability

We are working on improving resiliency for snapshots, which will provide some relief for the case where snapshots are getting stuck. An initial fix, merged for 6.5.3, solves this for the most common scenarios, but a larger fix is still out a bit, requiring more complex structural changes to the snapshot/restore code. We also made the repository creation logic more resilient to failures that would happen during repository creation.

Changelog

Changes in 6.5:

  • Remove license state listeners on closables #36308
  • add missing error type mapping for apm-server #36178
  • add missing error type mapping for apm-server monitoring #36273
  • Always configure soft-deletes field of IndexWriterConfig #36196
  • [CCR] AutoFollowCoordinator should tolerate that auto follow patterns may be removed #35945
  • Fix deprecation of audit log settings #36175
  • SNAPSHOT: Improve Resilience SnapshotShardService #36113
  • Fix error message when package install fails due to missing Java #36077

Changes in 6.6:

  • [ILM] fix ilm.remove_policy rest-spec #36165
  • initial cleanup of deprecation checks for 6.x #35326
  • BREAKING: [CCR] Change get autofollow patterns API response format #36203
  • Introduce Docker images build #36246
  • Build: Use explicit deps on test tasks for check #36325
  • ingest: support default pipeline through an alias #36231
  • Make credentials mandatory when launching xpack/migrate #36197
  • Correct doc reference tag #36262
  • HLRC: execute watch API #35868
  • Refactor AutoFollowCoordinator to track leader indices per remote cluster #36031
  • [HLRC] Added support for CCR Stats API #36213
  • Deprecation check for : in Cluster/Index name #36185
  • SNAPSHOT: Repo Creation out of ClusterStateTask #36157
  • Deprecate setting boost on inner span queries #36191
  • Combine the execution of an exclusive replica operation with primary term update #36116
  • Add DEBUG/TRACE logs for LDAP bind #36028
  • [HLRC] Added support for CCR Get Auto Follow Pattern apis #36049
  • BREAKING: Remove the distinction between query and filter context in QueryBuilders #35354
  • Added soft limit to open scroll contexts #25244 #36009
  • Enforce max_buckets limit only in the final reduction phase #36152
  • Fix smb-docker fixture when runnig with aufs #36105

Changes in 7.0:

  • [Zen2] Storage layer WriteStateException propagation #36052
  • Deprecate types in termvector and mtermvector requests. #36182
  • Introduce zen2 discovery type #36298
  • BREAKING: Make hits.total an object in the search response #35849
  • Make typeless APIs usable with indices whose type name is different from _doc #35790
  • Set Lucene version upon index creation. #36038
  • BREAKING: Remove the deprecated _termvector endpoint. #36131
  • Deprecate /_xpack/monitoring/* in favor of /_monitoring/* #36130

Lucene

Lucene 7.6

We’ve been working towards making the solr build pass by marking remaining failing tests as bad-apples - we hope we will move on with this soon.

Improved LatLongShapes

We are working on a new encoding for LatLongShapes, a new geo solution, which will bring important improvements for this feature. We are looking forward to get this new encoding in next week.

Optimized DocValues updates data-structures

We added an optimized data-structure that reduces the memory consumption of doc values updates by 70%-80% in the common case. In contrast to the previous implementation this new data-structure uses a non-object based storage. We also optimized the case when all updates share the same values which is always the case if soft-deletes are updated.

Other

  • We added a new GraphTokenFilter abstraction, which makes it easier for token filters that read ahead in the token stream (eg shingles or multi-word-matching synonyms) to follow graphs rather than assuming a linear token stream. FixedShingleFilter has been reworked to use this.
  • We also added a switch to WordDelimiterGraphFilter that allows you to not alter the offsets of internal tokens. This is important if you are using non-whitespace based tokenizers, as in certain cases WDGF can get confused about where prior token boundaries are, and emit tokens with offsets moving backwards. Backwards offsets are forbidden in lucene to allow effective compression of the postings lists.

Fixed Bugs

  • LUCENE-8586: Fixes a bug in the implementation of Intervals.or() that could lead to infinite looping