This Week in Elasticsearch and Apache Lucene - 2019-05-31

Elasticsearch Highlights

Deprecation Info for Joda to Java Time Migration

We have been implementing a deprecation info api for joda-java date changes that are not compatible from 7.0 on. This notifications will inform a user that they need to update their mappings and/or pipelines if they contain patterns that are not compatible.

Enrich

We are working on adding an enrich cluster privilege for managing enrich policies and executing enrich policies. (#42677)

We worked on an improvement for the enrich processor by storing the content used for enrichment in a doc values field via a special meta field mapper instead of _source stored field. This tradeoff is beneficial for enrich, because retrieval speed is more important than optimal storage. (#42423)

We added a PR for performing validation on nested mapping fields (#42452) and one for limiting the number of concurrent policy executions (#42535)

Optimize sort on numeric fields

We have a PR to apply max-score optimizations on queries that sort documents by a numeric field. Currently all documents that match the query must be evaluated when sorting by a field other than score (relevancy) even if the total hits is not tracked. We have optimizations in place if the sort order is congruent with the index sort, but this implies that the index is sorted, which has a cost at indexing time, and the speedup only applies to a single case.

The new approach works for all date and long fields for both ascending and descending order. It’s available by default as long as the field is indexed and doc-values are activated. It uses the LongDistanceFeatureQuery, which is used when sorting by relevance to boost by proximity or by recency, under the hood to skip documents that are not competitive with the same mechanism that is used when sorting by relevance.

The initial performance tests showed great improvements, 6-7x, in some cases, and big regressions in others. This was expected since the optimization depends on the cardinality of the field and the repartition of these values within the segments/shard. For these reasons we are working on a heuristic to detect if this optimization should be applied or not depending on the values indexed for the field. We also discussed some ways to randomize the repartition of documents within a shard in order to ensure documents with the greatest/smallest values are not all on the same block. This is the case currently for time-based indices where recent documents appear at the end of each segment/shard so sorting by most recent is slower when this new optimization is applied.

We also created a rally-tracks PR to benchmark numeric sort for some of our datasets, this will help to track the progress of this work since we're planning to have multiple iterations on this feature.

Long running queries

We continued to work on improving the support for long running queries, opening a PR to separate frozen and non-frozen indices during search in order to minimize the cost of waiting for the response on slow indices (frozen). This sparked an idea on how we could segregate searches based on the type of the indices/shards; a shard that didn't receive any updates for some time could be considered as read-only during the search.

TLS Error Message Improvements

The default JRE SSL/TLS provider often doesn’t provide error the error messages we’d like; messages like “PKIX path building failed”. Even when the message is somewhat descriptive like “no cipher suites in common”, there isn’t enough information to actually solve the problem; listing the configured cipher suites would be a start. Some time ago, we started the Tealess project to try and provide useful diagnostics when TLS connections go bad. We’ve started exploring how we can use some of the ideas from Tealess in Elasticsearch so that the error messages we provide can be more meaningful to end users and guide them towards solutions. It’s early days and we’re not sure whether the current prototype is the direction we want to take, but it’s our current focus for trying to make security simpler.

ReindexV2

We’re discussing how to make reindexing resilient to both data node and coordinator node failures. For data node resiliency, we can query our source data sorted by sequence numbers to allow reindex to resume after a failure. There will be no inter-object consistency guarantees in case of concurrent updates to the source index during reindex, the assumption being that the vast majority of reindex operations are done from indices that are effectively read-only during the operation.

We created a meta-issue to track the work on the reindex resiliency-related items.

We fixed a long-standing issue where reindex from remote was not working with external versioning.

Distributed: Investigating test failures

We are working hard at reducing the number of open test failures (~50 beginning of May, ~30 last week, 22 now). Addressing some of these failures takes significant effort. To name a few:

  • We found a test bug that was causing a lot of noise in our CI.
  • We investigated a test failure which resulted in an adaptation of the cluster state APIs to always return a cluster state that has a master.
  • We investigated a test failure where the test could hang as our simulated network disruption of type "unresponsive" would silently swallow requests, breaking the premise that all transport requests eventually respond, either successfully or with a failure. We proposed a number of solutions to make the simulated network disruptions more realistic,.
  • We investigated a test failure where a cluster would get stuck in a state where state recovery would not be triggered, and fixed a bug which prevented state recovery to run after a very unfortunate series of events.

Apache Lucene

Geo-Land

We are working on several improvements, all in review at this point:

  • Update tessellator logic to label if a triangle edge belongs to the original polygon. This is necessary to implement CONTAINS for LatLonShape.
  • Adding LatLonShape distance query.
  • Refactoring of EdgeTree class to support different types of components. This will allow to support geometry collections in one query,
  • Fixing some corner cases in the tessellator, in particular some failures when holes share a vertex with the polygon.

Analysis

We collaborated on a patch to add an option in the Korean Tokenizer to preserves punctuation tokens

This option already exists in the Japanese tokenizer to allow subsequent token filter to normalize Japanese numbers (kansūji) into decimal numbers. Since Korean numbers are similar, we added this new option in the Tokenizer and we are now iterating on adding a KoreanNumberFilter similar to the JapaneseNumberFilter:

Performance

We are also working on supporting two-phase iterators in the WANDScorer (boolean disjunctions). Disjunctions of clauses that use two-phase iterators (phrase queries, script, ...) are slower in Lucene 8 so this change should fix the regression.

Bugs

One of our CI builds tripped an assertion that looked critical but turned out to be a rather theoretical where two consecutive commits are faster than applying an empty delete queue. We fixed the issue and improved correctness on applying deletes to protect from any future issues in this area since it's crucial for data integrity.

Misc

  • The work on docvalues support for {Int|Long|Float|Double}Range fields continues.

Changes in Elasticsearch

Changes in 8.0:

  • BREAKING: Remove client jar support from build #42640
  • Remove SecurityClient from x-pack #42471
  • Add explicit build flag for experimenting with test execution cacheability #42649
  • Removes types from SearchRequest and QueryShardContext #42112
  • BREAKING: Remove "nodes/0" folder prefix from data path #42489
  • BREAKING: Remove support for chained multi-fields. #42333

Changes in 7.3:

  • Deprecate CommonTermsQuery and cutoff_frequency #42619
  • Remove unused "messy-test" plugin #42684
  • Prevent merging nodes' data paths #42665
  • Remove usage of deprecated compare gradle builds plugin #42687
  • [ML] [Data Frame] add support for weighted_avg agg #42646
  • Fix Class Load Order in Netty4Plugin #42591
  • Log leader and handshake failures by default #42342
  • Detect when security index is closed #42191
  • Allow aggregations using expressions to use _score #42652
  • Lazily compute Java 8 home in reindex configuration #42630
  • Validate routing commands using updated routing state #42066
  • Improve how internal representation of pipelines are updated #42257
  • Upgrade to Netty 4.1.36 #42543
  • Shard CLI tool always check shards #41480
  • Cluster state from API should always have a master #42454

Changes in 7.2:

  • Fix refresh remote JWKS logic #42662
  • Propagate version in reindex from remote search #42412
  • [ML-DataFrame] rewrite start and stop to answer with acknowledged #42589
  • [ML Data Frame] Set DF task state to stopped when stopping #42516
  • Avoid loading retention leases while writing them #42620
  • Reset state recovery after successful recovery #42576
  • [ML-DataFrame] add support for fixed_interval, calendar_interval, remove interval #42427
  • Remove IndexStore and DirectoryService #42446
  • Build cache init script #42484

Changes in 6.8:

  • Percolator: Exclude nested documents #42554
  • Deprecation info for joda-java migration #41956
  • un-mute Watcher rolling upgrade tests and bump up logging #42377
  • Added param ignore_throttled=false when indicesOptions.ignoreThrottle… #42393
  • Only index into "doc" type in security index #42563
  • fixed ignoring name parameter for percolator queries #42598
  • Do not use ifSeqNo for update requests on mixed cluster #42596
  • Address test failures for SmokeTestWatcherWithSecurityIT #42092
  • Fix sorting on nested field with unmapped #42451