This Week in Elasticsearch and Apache Lucene - 2018-04-27

Elasticsearch

We have merged work replacing hard deletes with soft deletes soft-deletes into the CCR (Cross Cluster Replication) branch. Soft deletes are the first, and most difficult, step in allowing us to read the history of operations from Lucene rather then the translog. Moving to Lucene will allow us to use lucene's power for sorting and searching and thus avoid building similar capabilities in the translog.

Work on zen2 continues, both on the membership node for the new zen2 discovery code, as well as a formal proof for the new direction of zen2.

Opened PR for application privileges, soliciting feedback from the Kibana team to ensure this will meet their needs.

We have updated the plan to remove types so that the Elasticsearch APIs can become typeless by default as of 8.0 rather than 9.0. In 7.0 we will emit a deprecation warning whenever users don't set include_type_name=false when calling an API and make Elasticsearch fail requests when include_type_name is set to true in 8.0 (https://github.com/elastic/elasticsearch/pull/29586)

Changes in 6.2:

  • X-Pack:
    • Execute watcher lifecycle changes in order
    • Watcher: use same value of watcher state (6.x)

Changes in 6.3:

  • SQL: Add BinaryMathProcessor to named writeables list #30127
  • Fix TermsSetQueryBuilder.doEquals() method #29629
  • Deprecate the suggest metrics #29627
  • Implement Iterator#remove for Cache values iter #29633
  • Never leave stale delete tombstones in version map #29619
  • X-Pack:
    • Security: cache users in PKI realm
    • SQL: implement COT, RANDOM, SIGN math functions
    • SQL: Add Atan2 and Power functions
    • [ML] Prevent unnecessary job updates.
    • SQL: Fix version loading in JDBC

Changes in 6.4:

  • In the field capabilities API, deprecate support for providing fields in the request body. #30157
  • Add additional shards routing info in ShardSearchRequest #29533
  • Do not add noop from local translog to translog again #29637

Changes in 7.0:

  • Fix a bug in FieldCapabilitiesRequest#equals and hashCode. #30181
  • BREAKING: Remove the suggest metric from stats APIs #29635
  • Add support for field capabilities to the high-level REST client. #29664
  • X-Pack:
    • Watcher: use same value of watcher state

Lucene

Backward compatibility

A discussion about adding a way to rewrite all segments to ease upgrading turned into a discussion about backward compatibility. Some users merge segments initially created by version (X-1) with version X so that they can be read with version (X+1). While this solves the upgrading issue at the file format level, there are still issues with everything that is performed on top of codecs and file formats such as analysis, input validation and the encoding of norms. Lucene 8 introduces a restriction that prevents you from reading indices that have been created before Lucene 7 due to the fact that Lucene 7.0 removed index boosts and changed the way that norms are encoded.

ConditionalTokenFilter

Introduce a new ConditionalTokenFilter which allows to implement conditional logic in analysis chains. For instance you could decide to apply different token filters based on the character set that a token is using.

Removing memory codecs

Memory codecs occasionally cause out-of-memory errors in tests that index many documents since they are not memory-efficient. It was suggested that they be removed, but it was argued that they may be useful to users who need fast terms-dictionary lookups. In the end there seems to be agreement that we should still remove them and that using MMap preloading is a better option than forcefully loading terms into memory, but this triggered an interesting discussion regarding the fact that if we want features to be harder to remove, we will also need to make them equally hard to add or we will just accumulate technical debt.

Other