This Week in Elasticsearch and Apache Lucene - 2018-08-31
Exposing IntervalQuery in the query DSL
Lucene 7.4 adds a new IntervalQuery, intended as a longer-term replacement for the Spans family. The problem with span queries is that they are quite tricky to use:
- Span queries don't go through analysis so its up to the user to create queries that match the exact terms in the index
- The rules on how different span queries can combine and which queries types can be used with different span query are complex
- Span queries can have complex hierarchies with queries nested at multiple levels
The interval query aims to provide similar capabilities in an easier to use API. The interval query takes a single query and then defines "interval sources" to express the criteria for matching text on a single field defined on the query. There are multiple types of sources including:
- match - takes some text to match, analyses it and produces a term or phrase
- combine - combines other sources to allow matching sources near each other (either in order or not)
- relate - similar to "combine" in that it takes multiple sources but matches sources which based on the relation to each other (containing, not containing, contained by, etc.)
We have been working on a PR which exposes these queries in Elasticsearch. This is the first of a few PRs for this which will aim to extend the functionality further going forward.
We merged support
dissect into ingest node! This processor breaks strings into
parts as a simpler alternative to
grok. Testing has shown that it
is 8% faster than
grok for the
http_logs Rally track. This will be available in 6.5.0. Check out the docs and the specification.
We merged support for conditionals per processor. This enables users to specify conditionals that must be fulfilled for the processor to execute for a given document without having to resort to a script processor to do so. Preview of the documentation.
We also added the ability to call another pipeline from within a pipeline, enabling pipeline reuse. This is a feature that was requested by the ingest team.
We found a critical bug in distributed watch execution that can lead to some watches not being executed, as well as some watches being executed twice. The underlying problem is that when Watcher watches the cluster state for shard changes to decide which watches it owns, it was only taking into account local shard changes but it should be watching for remote shard changes too as the distribution of watches would change if a replica is added or removed. There is a fix that will be included in 6.4.1.
Changes in 5.6:
- Apply settings filter to get cluster settings API #33247
Changes in 6.3:
- Painless: Fix Semicolon Regression #33212
Changes in 6.4:
- SQL: prevent duplicate generation for repeated aggs #33252
- Fix serialization of empty field capabilities response #33263
- Fix nested _source retrieval with includes/excludes #33180
- Watcher: Reload properly on remote shard change #33167
- Parse PEM Key files leniantly #33173
- Core: Add java time xcontent serializers #33120
- Ensure to generate identical NoOp for the same failure #33141
Changes in 6.5:
- Painless: Fix Bindings Bug #33274
- Watcher: Fix race condition when reloading watches #33157
- Ignore module-info in jar hell checks #33011
- HLRC: add client side RefreshPolicy #33209
- Ingest: Add conditional per processor #32398
- [Rollup] Only allow aggregating on multiples of configured interval #32052
- ingest: Introduce the dissect processor #32884
- Painless: Add Bindings #33042
- HLRC: Use Optional in validation logic #33104
- HLRC: create base timed request class #33216
- Remove unsupported group_shard_failures parameter #33208
- INGEST: Add Pipeline Processor #32473
- REST high-level client: add reindex API #32679
- [Rollup] Better error message when trying to set non-rollup index #32965
- Introduce mapping version to index metadata #33147
- Token API supports the client_credentials grant #33106
- SQL: Enable aggregations to create a separate bucket for missing values #32832
- APM server monitoring #32515
- Add proxy support to RemoteClusterConnection #33062
- Security index expands to a single replica #33131
Changes in 7.0:
- TESTS: Fix overly long lines #33240
- BREAKING: Remove support for deprecated params._agg/_aggs for scripted metric aggregations #32979
- BREAKING: Scroll queries asking for rescore are considered invalid #32918
- Have circuit breaker succeed on unknown mem usage #33125
Delete-by-query bug with index sorting
There is a bug with deletes-by-query on sorted indices, but doesn't affect Elasticsearch, which always deletes by term. The delete-by-query endpoint runs deletes by term under the hood. If you have index sorting enabled and documents that are not flushed yet, then delete-by-query might delete the wrong documents. This is because Lucene makes sure that delete-by-query only deletes documents that are already in the indexing buffers (as opposed to other documents that will get indexed until the next flush) by recording the number of documents in the buffer at delete time and only deleting documents whose doc id is less than this number at flush time. The fact that sorted indices reorder documents on flush breaks this functionality. We fixed this bug by resolving the doc ID prior to sorting in order to know whether a document is eligible for deletion.
- We made sure to rewrite the query that is used to identify soft deletes before running it.
- replacing RAMDirectory with a new BytesBuffersDirectory.
- TopFieldCollector should stop calling the comparator when only counting hits.
- We are having discussions regarding how to keep existing functionality of span queries while also fixing their problems.
- We exposed the current number of bytes that are being flushed in IndexWriter, which will help know how many more bytes need to be reclaimed via flushNextBuffer.