This Week in Elasticsearch and Apache Lucene - 2019-03-29

Elasticsearch

6.7.0 released

We released Elastic Stack 6.7.0 in which we made ILM, CCR and SQL GA features. You can read about these features and all the other awesome things that went into Elasticsearch 6.7.0 in the release blog post. This blog post on CCR is also worth checking out for an overview of the feature.

Snapshot Lifecycle Management

Snapshot Lifecycle Management is a lot like Index Lifecycle Management, but instead of managing indices, it manages Snapshots. This past week we have been busy discussing the scope of the work and working on a branch so that that can now actually take snapshots! Follow the progress along at #40383

Ingest Node Lookup Processor

We have spent time this past week developing various prototype implementations for the ingest node lookup processor. We are narrowing down the technical design and then we will re-visit some of the functional conversations and officially kick off the development.

Snapshot and Restore UI

There is a PR open for creating and editing repositories, and another to refactor our file organisation and approach towards TypeScript.

Snapshot and Restore UI

Packaging

For the 6.7 and 7.0 releases of Elasticsearch, we will use JDK 12 as the base JDK in the Docker images, and the bundled JDK in our distributions (7.0 only).. Additionally the 7.0 docker images will now use the bundled JDK.

Performance

We published a blog post on the real memory circuit breaker in Elasticsearch 7.0.0. It is definitely worth checking out to understand more about this alternative that will help avoid OutOfMemoryErrors.

Templated Role Mappings

We merged a change that allows role mappings to use Mustache templates. This is intended to support some edge cases for mapping from SAML attributes to Elasticsearch roles, and also situations where users are authenticated by a builtin realm (such as Active Directory) but their roles need to be defined in a custom provider.

Consistent Secure Settings

Elasticsearch has a way to store settings that need to have the same value across the whole cluster (cluster settings) and a way to store settings that are secret and need to be stored securely (keystore settings) but we don't currently have a way to get both of those properties. This feature is important for things like local encrypytion keys where all nodes need to be encrypting data using the same set of keys.

We are working on Consistent Secure Settings that validate that, for specific keystore settings, all nodes have been configured with an identical value for that setting.

Snapshot resiliency

Snapshot deletions can leave orphaned files behind in case where the master fails during the deletion process. We are looking for ways of making this process more resilient as well as generally speeding up the deletion process, which currently triggers the file deletions in a sequential way. One step is moving the repository-internal deletion APIs to an async interface, which will allow some sections of the snapshot deletion logic to be executed concurrently by multiple threads. Another change here is adding support for S3's bulk deletion API, which will dramatically reduce the number of S3 API requests for deletes and has the potential to massively speed up deleting large snapshots from S3.

Replicated closed indices

We investigated how many replicated closed indices can be allocated on a small node instance with 1GB heap and explored ways to reduce the memory overhead. As replicated closed indices can't be indexed into or searched, we can avoid creating a mapper service and skip the initialization of the caching infrastructure that preallocates a fixed amount of memory per index.

Search As You Type

We have merged a new field type called searchasyou_type. It uses the inverted lists to provide an infix document based suggester.

Coupled with the new ability to skip non-competitive documents during search this new field type is a good alternative to the completion suggester:

  • Any fields in the mapping can be used to filter the result so no need to add specific contexts in the suggest inputs.
  • The scoring of documents can be modified at query time using the ranking and boosts DSL like other regular queries.

We also added a new variant for the match query calledmatch_bool_prefix that is particularly suited for this new field type. The matchboolprefix query analyzes its input and constructs a bool query from the terms except that the last term is considered as a prefix.

Script Score query

We added random score function to the script_score query. This PR is the last piece of work to make script_score feature in parity with function_score.

Rounding in SQL

We have rewritten the ROUND and TRUNCATE functions in SQL. A particular characteristic of these two functions is that one of their parameters is optional. And the way this parameter was handled in the constructor before the rewrite, made SQL not able to correctly compare two ROUND functions inside the same SQL query (think of SELECT ROUND(salary, 2) FROM emp GROUP BY ROUND(salary, 2) where each ROUND "instance" in the query actually means two separate not "equal" Java instances). Because of this bug the sample query above would have triggered an exception.

Lucene

Bugfixes

A bug in the bridge code between ValueSource and DoubleValuesSource was fixed

An intervals issue is being investigated where an unordered search without overlaps can miss hits but hasn't there is no issue yet.

We found a pretty serious issue where FilteredDirectory didn't delegate pending deletes correctly causing corrupted indices in certain situations. So far we haven't seen anything in the wild but only in test scenarios. The same issue was also chased but this fix was simpler.

Enhancements

The constant score query can now early terminate queries if the minimum score is greater than the constant score and the number of hits that match the query is not tracked

We continued to work on improving the block join query (nested query) to add early termination when the score of children is ignored and to fix a regression when scores are requested.

Lucene can now automatically pull up disjunctions in intervals queries where the internal gaps of an interval are important (ie in phrase queries or queries wrapped with a MAXGAPS source). Take the disjunction OR(ORDERED("a", "b"), "a"); given the document string a b c, this will only return the interval 'a', as 'a -> b' gets minimised away; which further means that BLOCK(OR(ORDERED("a", "b"), "a"), "c") won't match the document, because the resulting interval 'a -> c' contains a gap. The way round this is to rewrite to OR(BLOCK("a", "b", "c"), BLOCK("a", "c")), which is now done automatically. Because this can end up being less efficient (intervals for "c" get pulled twice in the rewritten source here), there is an option to prevent a disjunction from being rewritten.

We are currently working on a small change to WordDelimiiterGraphFilter that will make removing the deprecated WordDelimiterFilter easier

Changes

Changes in Elasticsearch

Changes in 7.1:

  • Updates max dimensions for sparsevector and densevector to 1024. #40597
  • Add start and stop time to cat recovery API #40378
  • Add randomScore function in script_score query #40186
  • Move top-level pipeline aggs out of QuerySearchResult #40319
  • SQL: Polish behavior of SYS TABLES command #40535
  • Improve error message for absence of indices #39789
  • Get node ID from nodes info in REST tests #40052
  • search as you type fieldmapper #35600
  • ignore 409 conflict in reindex responses #39543
  • Fix an off-by-one error in the vector field dimension limit. #40489
  • No mapper service and index caches for replicated closed indices #40423
  • SQL: Adjust the precision and scale for drivers #40467
  • Remove String interning from o.e.index.Index. #40350
  • [ML] Data Frame HLRC Get API #40209
  • Adding a soft limit to the field name length. Closes #33651 #40309
  • Support mustache templates in role mappings #39984
  • Resolve JAVA_HOME at windows service install time #39714
  • SQL: Polish parsing of CAST expression #40428
  • [ML] Data Frame HLRC Get Stats API #40327
  • SQL: Fix classpath discovery on Java 10+ #40420
  • SQL: Spec tests now use classpath discovery #40388
  • Add implicit this for class binding in Painless #40285

Changes in 7.0:

  • Optimise rejection of out-of-range long values #40325
  • Use default discovery implementation for single-node discovery #40036
  • Deprecate types in _graph/explore calls. #40466
  • Remove timeout task after completing cluster state publication #40411

Changes in 6.7:

  • Parse composite patterns using ClassicFormat.parseObject #40100
  • Correct ILM metadata minimum compatibility version #40569
  • Handle null retention leases in WaitForNoFollowersStep #40477
  • SQL: add "fuzziness" option to QUERY and MATCH function predicates #40529
  • Wrap Dockerfile yum operations in retries #40349
  • Only run retention lease actions on active primary #40386
  • Fix major version in 6.7 branch #40507
  • Store Pending Deletions Fix #40345
  • Update feature aware check ASM to 7.1 #40389
  • SQL: Fix RLIKE bug and improve testing for RLIKE statement #40354
  • SQL: CAST supports both SQL and ES types #40365
  • Geo Point parse error fix #40447

Changes in 6.6:

  • SQL: Fix getTime() methods in JDBC #40484
  • SQL: Fix metric aggs on date/time to not return double #40377
  • SQL: Add missing handling of IP field in JDBC #40384
  • SQL: JLine upgrade and polishing #40321

Changes in Elasticsearch Management UI

Changes in 7.1:

  • Instrument ILM with user action telemetry #33089

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.1:

  • Fail quickly if hostname starts with a HTTP(S) scheme #138
  • Enable testing with all Kibana sample data #137
  • Add request params for multi-value fields and timezone #132

Changes in 6.7:

  • Advertise supported currentdate and currenttimestamp #136