This Week in Elasticsearch and Apache Lucene - 2018-06-29

Elasticsearch

Highlights

Upgrades to 6.3.0 from 5.x recovery issue

After releasing 6.3.0 we received reports that upgrades from 5.x to 6.3.0 fail, as replicas would not recover. The problem was caused by a refactoring which simplified our Engine. One affected item was a BWC bridge that upgrades the Lucene index to include 6.x markers, specifically, the history uuid that is required for ops based recovery. The BWC bridge was moved to the part of recovery that copies over files from another shard. As part of index preparation, we check and place the markers in the newly built Lucene index. However, shards that have been prepared for the upgrade using a Synced Flush command do not copy any files, but use the existing files as is. This bypassed our BWC bridge causing recovery to fail. This uncovered a gap in our testing as well, as using synced flush is the recommended way to do upgrades. We fixed it for the the coming 6.3.1 release.

JDK 11

    With JDK 11 GA around the corner (mid-late September), we are actively working on getting JDK 11 into CI. This has been an onerous effort for a few reasons:

      • no version of Gradle prior to Gradle 4.8 can run on JDK 11
      • Gradle 4.8 introduced several breaking changes in a minor release
      • dealing with JDK 11 breaking changes and test failures

      Our target is to get this into CI for the 6.x branch after the 6.4 branch is cut in preparation for supporting 6.5 on JDK 11. The Gradle 4.8 work is complete, we have JDK 11 in CI on a feature branch, and the PR to enable JDK 11 is almost ready.

      Hadoop

      We finalized the error handling API, a heavily-demanded feature for enabling handling in the connector for errors that occur trying to serialize records and enables handling situations like fields being of the correct type or missing when trying to extract document metadata. Real-world data is messy and this enables applications that are built on top of the connector to be more selective in their handling of such situations

      x-opaque-id header now appears in the query slowlog

      We have changed the query slowlog so it includes the value of the x-opaque-id header if its present. This means that users can add the header to allow them to identify queries in the slow log by an id that means something to them and their application. This should help users and support more easily identify where slow queries are originating from.

      FIPS

      We have merged configurable password hashing.

      Pull Requests

      Changes in 5.6:
      • Ingest Attachment: Upgrade Tika to 1.18 #31252
      Changes in 6.3:
      • Remove item listed in 6.3 notes that’s not in 6.3 #31623
      • Preserve thread context when connecting to remote cluster #31574
      • Add package pre-install check for java binary #31343
      • Watcher: Fix put watch action #31524
      • Fix missing historyUUID in peer recovery when rolling upgrade 5.x to 6.3 #31506
      Changes in 6.4:
      • Add Get Snapshots High Level REST API #31537
      • Add Create Snapshot to High-Level Rest Client #31215
      • Fix CreateSnapshotRequestTests Failure #31630
      • BREAKING Java: Configurable password hashing algorithm/cost #31234
      • Add rest highlevel explain API #31387
      • Add MultiSearchTemplate support to High Level Rest client #30836
      • Do not check for Azure container existence everytime an Azure object is accessed or modified #31617
      • Node selector per client rather than per request #31471
      • JDBC driver prepared statement set* methods #31494
      • Improve robustness of geo shape parser for malformed shapes #31449
      • Get Mapping API to honour allow_no_indices and ignore_unavailable #31507
      • REST high-level client: add simulate pipeline API #31158
      • Fix Mockito trying to mock IOException that isn’t thrown by method (#31433) #31527
      • fix writeIndex evaluation for aliases #31562
      • Add x-opaque-id to search slow logs #31539
      • turn GetFieldMappingsResponse to ToXContentObject #31544
      • Add get field mappings to High Level REST API Client #31423
      • IndexShard should not return null stats #31528
      • fix repository update with the same settings but different type #31458
      • Upgrade to Lucene 7.4.0. #31529
      • BREAKING Java: Allow multiple unicast host providers #31509
      • [PkiRealm] Invalidate cache on role mappings change #31510
      • [Security] Check auth scheme case insensitively #31490
      Changes in 7.0:
      • ingest: Add ignore_missing property to foreach filter (#22147) #31578

      Lucene

      Lucene 7.4

      Lucene 7.4 is now out and features a korean analyzer, a new matches API, support for soft deletes and detection of emoji sequences in ICUTokenizer. This is the Lucene version that Elasticsearch 7.4 will be based on.

      Forcemerge now honors the maximum segment size

      Force-merge has caused a lot of trouble in the past. Say you have a 50GB index and merge it down to one 50GB segment thanks to a force-merge. If your index receives updates, you are in a terrible situation: because the merge policy avoids to create segments that are larger than the maximum segment size (5GB by default) through natural merges, this segment won't be considered for merging until at least 90% of the documents that it contains are marked as deleted. This is a lot of disk overhead, and there is no easy way to recover from it.

      In order to avoid this issue, TieredMergePolicy, the default merge policy, has been refactored so that force-merges may not create segments that are larger than the maximum segment size, just like natural merges. Even though it will still be discouraged to force-merge read-write indices, consequences of doing so are going to be much less annoying.

      Other

      - Can Lucene benefit from GPU acceleration? Given the call overhead of JNI and the fact that GPUs don't like branches, our best guess is BooleanScorer (sometimes referred to as "BS1"), a specialized scorer for top-level disjunctions that scores windows of 2048 documents at once. Most other hot code paths in Lucene involve lots of branches or are called in tight loops, which makes them bad candidates to run on a GPU.

      - LatLonPoint should move to core. LatLonPoint is our best implementation for geo points (used by Elasticsearch since 5.x) but has remained in sandbox due to disagreement about whether it should go to lucene-spatial or lucene-core. We hope to fix this soon. This made David start a discussion about whether we should remove the spatial module, which is nearly empty.

      - The list of english stopwords is moving from StandardAnalyzer to EnglishAnalyzer, making english a language like any other to Lucene.

      - Sparse doc values would benefit from better indexing