This Week in Elasticsearch and Apache Lucene - 2018-08-31

Elasticsearch

Exposing IntervalQuery in the query DSL

Lucene 7.4 adds a new IntervalQuery, intended as a longer-term replacement for the Spans family. The problem with span queries is that they are quite tricky to use:

  • Span queries don't go through analysis so its up to the user to create queries that match the exact terms in the index
  • The rules on how different span queries can combine and which queries types can be used with different span query are complex
  • Span queries can have complex hierarchies with queries nested at multiple levels

The interval query aims to provide similar capabilities in an easier to use API. The interval query takes a single query and then defines "interval sources" to express the criteria for matching text on a single field defined on the query. There are multiple types of sources including:

  • match - takes some text to match, analyses it and produces a term or phrase
  • combine - combines other sources to allow matching sources near each other (either in order or not)
  • relate - similar to "combine" in that it takes multiple sources but matches sources which based on the relation to each other (containing, not containing, contained by, etc.)

We have been working on a PR which exposes these queries in Elasticsearch. This is the first of a few PRs for this which will aim to extend the functionality further going forward.

Ingest Node

We merged support for dissect into ingest node! This processor breaks strings into parts as a simpler alternative to grok. Testing has shown that it is 8% faster than grok for the http_logs Rally track. This will be available in 6.5.0. Check out the docs and the specification.

We merged support for conditionals per processor. This enables users to specify conditionals that must be fulfilled for the processor to execute for a given document without having to resort to a script processor to do so. Preview of the documentation.

We also added the ability to call another pipeline from within a pipeline, enabling pipeline reuse. This is a feature that was requested by the ingest team.


Watcher

We found a critical bug in distributed watch execution that can lead to some watches not being executed, as well as some watches being executed twice. The underlying problem is that when Watcher watches the cluster state for shard changes to decide which watches it owns, it was only taking into account local shard changes but it should be watching for remote shard changes too as the distribution of watches would change if a replica is added or removed. There is a fix that will be included in 6.4.1.

Changes

Changes in 5.6:

  • Apply settings filter to get cluster settings API #33247

Changes in 6.3:

  • Painless: Fix Semicolon Regression #33212

Changes in 6.4:

  • SQL: prevent duplicate generation for repeated aggs #33252
  • Fix serialization of empty field capabilities response #33263
  • Fix nested _source retrieval with includes/excludes #33180
  • Watcher: Reload properly on remote shard change #33167
  • Parse PEM Key files leniantly #33173
  • Core: Add java time xcontent serializers #33120
  • Ensure to generate identical NoOp for the same failure #33141

Changes in 6.5:

  • Painless: Fix Bindings Bug #33274
  • Watcher: Fix race condition when reloading watches #33157
  • Ignore module-info in jar hell checks #33011
  • HLRC: add client side RefreshPolicy #33209
  • Ingest: Add conditional per processor #32398
  • [Rollup] Only allow aggregating on multiples of configured interval #32052
  • ingest: Introduce the dissect processor #32884
  • Painless: Add Bindings #33042
  • HLRC: Use Optional in validation logic #33104
  • HLRC: create base timed request class #33216
  • Remove unsupported group_shard_failures parameter #33208
  • INGEST: Add Pipeline Processor #32473
  • REST high-level client: add reindex API #32679
  • [Rollup] Better error message when trying to set non-rollup index #32965
  • Introduce mapping version to index metadata #33147
  • Token API supports the client_credentials grant #33106
  • SQL: Enable aggregations to create a separate bucket for missing values #32832
  • APM server monitoring #32515
  • Add proxy support to RemoteClusterConnection #33062
  • Security index expands to a single replica #33131

Changes in 7.0:

  • TESTS: Fix overly long lines #33240
  • BREAKING: Remove support for deprecated params._agg/_aggs for scripted metric aggregations #32979
  • BREAKING: Scroll queries asking for rescore are considered invalid #32918
  • Have circuit breaker succeed on unknown mem usage #33125

Apache Lucene

Lucene 8

We are checking out what remains to be done to release Lucene 8.0.

Delete-by-query bug with index sorting

There is a bug with deletes-by-query on sorted indices, but doesn't affect Elasticsearch, which always deletes by term. The delete-by-query endpoint runs deletes by term under the hood. If you have index sorting enabled and documents that are not flushed yet, then delete-by-query might delete the wrong documents. This is because Lucene makes sure that delete-by-query only deletes documents that are already in the indexing buffers (as opposed to other documents that will get indexed until the next flush) by recording the number of documents in the buffer at delete time and only deleting documents whose doc id is less than this number at flush time. The fact that sorted indices reorder documents on flush breaks this functionality. We fixed this bug by resolving the doc ID prior to sorting in order to know whether a document is eligible for deletion.

Other