This Week in Elasticsearch and Apache Lucene - 2018-12-15

Elasticsearch

Index Lifecycle Management

We completed work on the ILM UI, and it will ship with 6.6. We also completed a couple of enhancements, including a refresh indices list button for index management and a show step information display for the index lifecycle management summary in index management, and also fixed a bug with console and executing multiple requests.

We added an index setting to indicate that ILM should skip rollover of an index, which enables some Beats workflows as they adopt ILM and is a step towards making ILM and CCR work well together.

Cross Cluster Replication

We continue work on the CCR UI, adding client side validation for auto-follow patterns, implementing some suggestions from design.

In Elasticsearch, Bootstrapping a follower index from a leader index will internally be based on the repository restore functionality. As multiple indices can be restored at the same time, we need our snapshot / restore logic to handle concurrent restores. We made the necessary changes to run restores in parallel, which will not only benefit CCR but also allow concurrent restores from snapshots.

We implemented the basic infrastructure for kicking off a restore of a follower index, configuring the new index with the correct settings for a follower index. We are wrapping up the work on the auto-follow patterns.

Audit Logging

We continue to improve the audit logging. This week we added the origin address to authentication success events, fixed the origin type for connection granted/denied, and opened a PR to add support for the X-Forwarded-For header.

Community PRs

We helped to review a community PR that adds a keyed parameter to percentiles bucket agg. This makes percentiles bucket consistent with other aggs. This was labeled as a "first-issue", good to see the community using it!

We helped a community member add pipeline validation for the auto_date_histo agg. Some aggs depend on order (derivative, etc) and so validate their parent agg, but we forgot to add auto_date_histo to that list. The PR also did some refactoring to make it harder to miss this in the future.

Vector Field

We merged support for a new vector field type, which supports both dense and sparse vectors. Documents will be able to be scored based on comparing a query vector against the stored vector using distance measurements (cosine, manhattan, etc). There are several use cases for this field in Machine Learning, Natural Language Processing, and Biosciences

SQL

We added two additional datetime functions for the non-ISO version of DAY_OF_WEEK and WEEK_OF_YEAR functions (their ISO version was already present in SQL) and made changes to how SQL handles REST requests by moving all parameters to JSON body. We also wrapped up the histogram PR.

We opened a PR to add a set of lightweight geo primitives for the JDBC driver. This allows the user to get typed geo objects instead of just WKT strings, but doesn't carry the dependency burden of something like JTS.

Changes

Changes in 6.5:

  • SQL: Fix MOD() for long and integer arguments #36599
  • fix MultiValuesSourceFieldConfig toXContent #36525
  • Fix origin.type for connection_* events #36410
  • Scripting: Properly support no-offset date formatting #36316

Changes in 6.6:

  • HLRC: Add get users action #36332
  • Always compress based on the settings (#36522) #36566
  • Warn when using use_dis_max in multi_match #36614
  • Add sequence numbers based optimistic concurrency control support to Engine #36467
  • SQL: be lenient for tests involving comparison to H2 but strict for csv spec tests #36498
  • Core: Add backcompat for joda time formats #36531
  • Geo: Adds a name of the field to geopoint parsing errors #36529
  • deprecation info API: fix value for index.shard.check_on_startup #36458
  • Override the JVM DNS cache policy #36570
  • deprecation info API: negative index.unassigned.node_left.delayed_timeout #36454
  • SQL: do not ignore all fields whose names start with underscore #36214
  • SETTINGS: Correctly Identify Noop Updates #36560
  • Periodically try to reassign unassigned persistent tasks #36069
  • Always compress based on the settings #36522
  • Deprecation check for index threadpool #36520
  • Deprecation check for HTTP pipelining #36521
  • BREAKING: Make node field in JoinRequest private #36405
  • [CCR] Change AutofollowCoordinator to use wait_for_metadata_version #36264
  • plugin install: don’t print download progress in batch mode #36361
  • [CCR] Clean followed leader index UUIDs in auto follow metadata #36408
  • Require soft-deletes when access changes snapshot #36446
  • HLRC: Implement get-user-privileges API #36292
  • SQL: non ISO 8601 versions of DAY_OF_WEEK and WEEK_OF_YEAR functions #36358
  • Modify BigArrays to take name of circuit breaker #36461
  • Modify BigArrays to take name of circuit breaker (#36461) #36501
  • [Painless] Add def to boxed type casts #36506
  • Deprecation check for percolator.map_unmapped_fields_as_string #36460
  • Add default methods to DocValueFormat #36480
  • SQL: move requests' parameters to requests JSON body #36149
  • add missing error type mapping for apm-server monitoring #36319
  • Add non-X-Pack centric rollup endpoints #36383
  • Simplify deprecation issue levels #36326
  • Deprecation check for http.enabled setting #36394
  • SQL: Simplify function registration and resolution #36417
  • Added parent validation for auto date histogram #35670
  • Add origin_address to authentication_success #36409
  • Deprecation check for tribe node #36240
  • [hlrc] add index templates exist API #36132
  • LLRC: Make warning behavior pluggable per request #36345
  • HLRC: Add rollup search #36334
  • [HLRC] Put Role #36209
  • Use delCount of SegmentInfos to calculate numDocs #36323
  • Exposed engine must include all operations below global checkpoint during rollback #36159
  • Option to use endpoints starting with _security #36379
  • [HLRC] Added support for Follow Stats API #36253
  • RestClient: on retry timeout add root exception #25576
  • [HLRC] Add support for put privileges API #35679
  • Undeprecate /_watcher endpoints #36269
  • HLRC: Add delete template API #36320
  • [Painless] Generate Bridge Methods #36097
  • Deprecation check for File Discovery plugin #36190
  • HLRC: Get Deprecation Info API #36279
  • Add support for inlined user dictionary in Nori #36123
  • Inner hits fail to propagate doc-value format. (#36310) #36355

Changes in 7.0:

  • Deprecate uses of _type as a field name in queries #36503
  • Re-deprecate xpack rollup endpoints #36451
  • Enable soft-deletes by default on 7.0.0 or later #36141
  • Deprecate types in explain requests. #35611
  • Deprecate types in update_by_query and delete_by_query #36365
  • [Zen2] Respect the no_master_block setting #36478
  • BREAKING: lower fielddata circuit breaker’s default limit #27162
  • Add discovery types to cluster stats #36442
  • Deprecate /_xpack/security/* in favor of /_security/* #36293
  • Deprecate types in get, exists, and multi get. #35930
  • Cancel GetDiscoveredNodesAction when bootstrapped #36423
  • Deprecate X-Pack centric watcher endpoints #36218
  • [Zen2] Support rolling upgrades from Zen1 #35737
  • For msearch templates, make sure to use the right name for deprecation logging. #36344
  • [Zen2] Add warning if cluster fails to form fast enough #35993
  • BREAKING: [Zen2] Best-effort cluster formation if unconfigured #36215
  • Inner hits fail to propagate doc-value format. #36310
  • Vector field #33022

Apache Lucene

Lucene 7.6

Lucene 7.6 has been released!

Improved LatLonShape

We are working on a new encoding for LatLongShapes, a new geo solution, which will bring important improvements for this feature. We are looking forward to get this new encoding soon.

This week we were also focusing on refactoring current tests to make sure they are not sensitive to numerical errors. This is very important for the progress of this feature since stable tests form the basis for significant improvements like this. As part of this effort we pushed a change which separates query logic with spatial logic for bounding boxes. This allows us to use the spatial logic on the test to be consistent with what the query does.

Improvements to MatchAllDocsQuery

We started working on investigating slowdowns found by our benchmarking team to the match_all query in elasticsearch since we moved master to a Lucene 8 snapshot. This resulted in a change to Lucene that allows MatchAllDocsQuery to shortcut if total hit count is not needed.

Apparently MatchAllDocsQuery was not making use of the new mechanisms in Lucene to skip document collection if we don’t need an exact total hit count, and that the overhead associated with these mechanisms was in fact slowing things down. After changing it so that we can skip docs, Lucene benchmarks show MatchAllDocsQuery going from 500 QPS to 12,000 QPS, a 2500% increase.

Performance Improvements to applying DocValues updates

We started to investigate if a sort algorithm that uses less swap operations than InPlaceMergeSorter if the underlying data-structure has a significant cost when accessing an element. This is particularly prominent in the case when we sort packed ints during applying doc values updates in the IndexWriter. Using a IntroSorter gave us a 2.3x speedup for sorting for the price of little more transient memory.

We pushed a change to share the term and docs enum more efficiently across terms when applying updates that yields a 2x performance gain on larger term dictionaries. This change also resulted in further cleanups such that we can now share the code on how to apply updates and deletes.

Additionally we added a significant improvement in RAM consumption and Runtime performance when applying updates where all updates share the same value. This is the change for instance in the soft-deletes case.

Using a sparse Bitset to store doc IDs to apply updates yields 10x perf gains and about 8x reduction in RAM consumption.

Other improvements

We added a gaps() method to IntervalIterator. This allows us to formulate positional queries such as (x NEAR y) within 4 positions of (z NEAR q), and is important for the API of our new DSL intervals query.

We worked with a new contributor on fixing explain details on ConstantScoreQuery.