03 September 2018

This Week in Elasticsearch and Apache Lucene - 2018-08-31

Paul Sanwald

•

•

•

•

•

Colin Goodheart-Smithe

Elasticsearch

Exposing IntervalQuery in the query DSL

Lucene 7.4 adds a new IntervalQuery, intended as a longer-term replacement for the Spans family. The problem with span queries is that they are quite tricky to use:

Span queries don't go through analysis so its up to the user to create queries that match the exact terms in the index
The rules on how different span queries can combine and which queries types can be used with different span query are complex
Span queries can have complex hierarchies with queries nested at multiple levels

The interval query aims to provide similar capabilities in an easier to use API. The interval query takes a single query and then defines "interval sources" to express the criteria for matching text on a single field defined on the query. There are multiple types of sources including:

match - takes some text to match, analyses it and produces a term or phrase
combine - combines other sources to allow matching sources near each other (either in order or not)
relate - similar to "combine" in that it takes multiple sources but matches sources which based on the relation to each other (containing, not containing, contained by, etc.)

We have been working on a PR which exposes these queries in Elasticsearch. This is the first of a few PRs for this which will aim to extend the functionality further going forward.

Ingest Node

We merged support for dissect into ingest node! This processor breaks strings into parts as a simpler alternative to grok. Testing has shown that it is 8% faster than grok for the http_logs Rally track. This will be available in 6.5.0. Check out the docs and the specification.

We merged support for conditionals per processor. This enables users to specify conditionals that must be fulfilled for the processor to execute for a given document without having to resort to a script processor to do so. Preview of the documentation.

We also added the ability to call another pipeline from within a pipeline, enabling pipeline reuse. This is a feature that was requested by the ingest team.

Watcher

We found a critical bug in distributed watch execution that can lead to some watches not being executed, as well as some watches being executed twice. The underlying problem is that when Watcher watches the cluster state for shard changes to decide which watches it owns, it was only taking into account local shard changes but it should be watching for remote shard changes too as the distribution of watches would change if a replica is added or removed. There is a fix that will be included in 6.4.1.

Changes

Changes in 5.6:

Apply settings filter to get cluster settings API #33247

Changes in 6.3:

Painless: Fix Semicolon Regression #33212

Changes in 6.4:

SQL: prevent duplicate generation for repeated aggs #33252
Fix serialization of empty field capabilities response #33263
Fix nested _source retrieval with includes/excludes #33180
Watcher: Reload properly on remote shard change #33167
Parse PEM Key files leniantly #33173
Core: Add java time xcontent serializers #33120
Ensure to generate identical NoOp for the same failure #33141

Changes in 6.5:

Painless: Fix Bindings Bug #33274
Watcher: Fix race condition when reloading watches #33157
Ignore module-info in jar hell checks #33011
HLRC: add client side RefreshPolicy #33209
Ingest: Add conditional per processor #32398
[Rollup] Only allow aggregating on multiples of configured interval #32052
ingest: Introduce the dissect processor #32884
Painless: Add Bindings #33042
HLRC: Use Optional in validation logic #33104
HLRC: create base timed request class #33216
Remove unsupported group_shard_failures parameter #33208
INGEST: Add Pipeline Processor #32473
REST high-level client: add reindex API #32679
[Rollup] Better error message when trying to set non-rollup index #32965
Introduce mapping version to index metadata #33147
Token API supports the client_credentials grant #33106
SQL: Enable aggregations to create a separate bucket for missing values #32832
APM server monitoring #32515
Add proxy support to RemoteClusterConnection #33062
Security index expands to a single replica #33131

Changes in 7.0:

TESTS: Fix overly long lines #33240
BREAKING: Remove support for deprecated params._agg/_aggs for scripted metric aggregations #32979
BREAKING: Scroll queries asking for rescore are considered invalid #32918
Have circuit breaker succeed on unknown mem usage #33125

Apache Lucene

Lucene 8

We are checking out what remains to be done to release Lucene 8.0.

Delete-by-query bug with index sorting

There is a bug with deletes-by-query on sorted indices, but doesn't affect Elasticsearch, which always deletes by term. The delete-by-query endpoint runs deletes by term under the hood. If you have index sorting enabled and documents that are not flushed yet, then delete-by-query might delete the wrong documents. This is because Lucene makes sure that delete-by-query only deletes documents that are already in the indexing buffers (as opposed to other documents that will get indexed until the next flush) by recording the number of documents in the buffer at delete time and only deleting documents whose doc id is less than this number at flush time. The fact that sorted indices reorder documents on flush breaks this functionality. We fixed this bug by resolving the doc ID prior to sorting in order to know whether a document is eligible for deletion.

Other

We made sure to rewrite the query that is used to identify soft deletes before running it.
replacing RAMDirectory with a new BytesBuffersDirectory.
TopFieldCollector should stop calling the comparator when only counting hits.
We are having discussions regarding how to keep existing functionality of span queries while also fixing their problems.
We exposed the current number of bytes that are being flushed in IndexWriter, which will help know how many more bytes need to be reclaimed via flushNextBuffer.

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2018-08-31

Elasticsearch

Exposing IntervalQuery in the query DSL

Ingest Node

Watcher

Changes

Apache Lucene

Lucene 8

Delete-by-query bug with index sorting

Other

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS