This Week in Elasticsearch and Apache Lucene - 2018-08-10

Elasticsearch

Highlights

Stability: We fixed several leaks that have been caused by third-party code. We’ve upgraded to Netty 4.1.28 which fixes a leak in their SSL implementation, and upgraded Log4J to fix a memory leak. We will ship the Log4J fix with 5.6.11 and 6.4.0.

Watcher: migrated Watcher to the PagerDuty v2 API; we want to upgrade early, although their v1 events API can still be used. Also, integrated BulkProcessor with Watcher to batch Watcher's document operations when users run lots of watches.

Improvements to Exception information for search failures: We have fixed a few different bugs that were uncovered whilst working on Cross Cluster Search. The first makes sure that we preserve the cluster alias when we get a failure in a cross cluster search. This means that if failures happen on different remote clusters with the same index name, the user can see the cluster that the failure happened on. The second fixes the QueryShardException so we report the index_uuid for the shard that failed as well as the other shard information.

Cross Cluster Replication: We’ve added support to enable metricbeat (system metrics) collection on all launched nodes. We have updated Rally's ccr-stats telemetry device to comply with the changes in the CCR stats API and set it as default for our benchmarks.

Mapping, Bulk and retries: We’ve had users reporting replica shards with huge translog files, caused by missing sequence numbers that were failed to replicate to the replicas, causing the local checkpoint on these replicas to fall behind. Since the translog is supposed to retain all operations above the local checkpoint, it became very big. The source of those missing sequence numbers turned out to be a mistake in how we handled timeouts while requesting the master to update the index mapping. In short, the entire bulk request was re-executed, causing the sequence numbers issued up to the point where the dynamic mapping were required to be lost. This was fixed in the 6.3.0 release of Elasticsearch with a targeted solution, but we wanted to have a more structural way to avoid retries all together. That refactoring was completed this week. The gist of the refactoring is that we now can pause and continue execution from where we were.

This may sound like a very technical internal thing (and it is!) but it is exciting as it opens the door to tackle another issue with mapping updates. When an indexing operation requires a mapping update, it currently has to wait for *all* nodes in the cluster to process that change. This is an overkill as only the nodes with the primary and its replicas actually need that mapping to complete the operation. It's a shame if an overloaded cold node slows a hot node from processing an operation. With the above refactoring, we now have the underpinnings to untangle this dependency.

Changes

Changes in 5.6:

  • LOGGING: Upgrade to Log4J 2.11.1 #32675
  • Fix content type detection with leading whitespace #32632
  • LOGGING: Upgrade to Log4J 2.11.1 #32616

Changes in 6.4:

  • LOGGING: Upgrade to Log4J 2.11.1 (#32616) #32668
  • Preserve index_uuid when creating QueryShardException #32677
  • Make sure that field collapsing supports field aliases. #32648
  • Add temporary directory cleanup workarounds #32615
  • Networking+Testing: Fix Netty ByteBuf Leaks in Test Code #32638
  • Cross-cluster search: preserve cluster alias in shard failures #32608
  • [Rollup] Improve ID scheme for rollup documents #32558
  • HLRC: Move commercial clients from XPackClient #32596
  • Fix race between replica reset and primary promotion #32442

Changes in 6.5:

  • Fix role query that can match nested documents #32705
  • Add expected mapping type to MapperException #31564
  • SQL: Bug fix for the optional "start" parameter usage inside LOCATE function #32576
  • SQL: Ignore H2 comparative tests for uppercasing/lowercasing string functions #32604
  • Upgrade to Lucene-7.5.0-snapshot-13b9e28f9d #32730
  • Whitelisting / from Circuit Breaker Exception (#32325) #32666
  • LOGGING: Upgrade to Log4J 2.11.1 (#32616) #32656
  • Core: Fix Java Time DateFormatter printers #32592
  • BREAKING: Switch WritePipelineResponse to AcknowledgedResponse #32722
  • TESTS: Explicitly Fail Http Client Timeouts #32708
  • Prevent cause from being null in ShardOperationFailedException #32640
  • CORE: Upgrade to Jackson 2.8.11 #32670
  • Expose whether or not the global checkpoint updated #32659
  • Include translog path in error message when translog is corrupted #32251
  • Verify primary mode usage with assertions #32667
  • Ignore script fields when size is 0 #31917
  • Tests: Fix Typo Causing Flaky Settings Test #32665
  • Docs: Allow snippets to have line continuation #32649
  • INGEST: Fix ThreadWatchDog Throwing on Shutdown #32578
  • Rest HL client: Add get license action #32438
  • Adds ckb to the list of unsupported languages #32611
  • Suppress LicensingDocumentationIT.testPutLicense in release builds #32613
  • Add cluster UUID to Cluster Stats API response #32206

Changes in 7.0:

  • serialize suggestion responses as named writeables #30284

Apache Lucene

Faster queries when not counting hits

Nightly benchmarks were fixed so you can now see how much it helps not to count total hits when computing top hits. This especially helps term queries and disjunctions, even though conjunctions and phrase queries got an interesting speedup as well. On the other hand, disjunctions within conjunctions slowed down, which we are investigating.

We improved collection of top hits with dis-max queries and boolean queries that contain a mix of SHOULD and MUST clauses.

Lucene 8.0

In the coming weeks, we will start an effort to upgrade the master branch of Elasticsearch to a Lucene 8.0 snapshot so that we can validate that changes in Lucene 8 play well with the way that Elasticsearch leverages Lucene.

Other