This Week in Elasticsearch and Apache Lucene - 2018-04-27
Elasticsearch
We have merged work replacing hard deletes with soft deletes soft-deletes into the CCR (Cross Cluster Replication) branch. Soft deletes are the first, and most difficult, step in allowing us to read the history of operations from Lucene rather then the translog. Moving to Lucene will allow us to use lucene's power for sorting and searching and thus avoid building similar capabilities in the translog.
Work on zen2 continues, both on the membership node for the new zen2 discovery code, as well as a formal proof for the new direction of zen2.
Opened PR for application privileges, soliciting feedback from the Kibana team to ensure this will meet their needs.
We have updated the plan to remove types so that the Elasticsearch APIs can become typeless by default as of 8.0 rather than 9.0. In 7.0 we will emit a deprecation warning whenever users don't set include_type_name=false when calling an API and make Elasticsearch fail requests when include_type_name is set to true in 8.0 (https://github.com/elastic/elasticsearch/pull/29586)
Changes in 6.2:
Changes in 6.3:
- SQL: Add BinaryMathProcessor to named writeables list #30127
- Fix TermsSetQueryBuilder.doEquals() method #29629
- Deprecate the suggest metrics #29627
- Implement Iterator#remove for Cache values iter #29633
- Never leave stale delete tombstones in version map #29619
- X-Pack:
Changes in 6.4:
- In the field capabilities API, deprecate support for providing fields in the request body. #30157
- Add additional shards routing info in ShardSearchRequest #29533
- Do not add noop from local translog to translog again #29637
Changes in 7.0:
- Fix a bug in FieldCapabilitiesRequest#equals and hashCode. #30181
- BREAKING: Remove the suggest metric from stats APIs #29635
- Add support for field capabilities to the high-level REST client. #29664
- X-Pack:
Lucene
Backward compatibility
A discussion about adding a way to rewrite all segments to ease upgrading turned into a discussion about backward compatibility. Some users merge segments initially created by version (X-1) with version X so that they can be read with version (X+1). While this solves the upgrading issue at the file format level, there are still issues with everything that is performed on top of codecs and file formats such as analysis, input validation and the encoding of norms. Lucene 8 introduces a restriction that prevents you from reading indices that have been created before Lucene 7 due to the fact that Lucene 7.0 removed index boosts and changed the way that norms are encoded.
ConditionalTokenFilter
Introduce a new ConditionalTokenFilter which allows to implement conditional logic in analysis chains. For instance you could decide to apply different token filters based on the character set that a token is using.
Removing memory codecs
Memory codecs occasionally cause out-of-memory errors in tests that index many documents since they are not memory-efficient. It was suggested that they be removed, but it was argued that they may be useful to users who need fast terms-dictionary lookups. In the end there seems to be agreement that we should still remove them and that using MMap preloading is a better option than forcefully loading terms into memory, but this triggered an interesting discussion regarding the fact that if we want features to be harder to remove, we will also need to make them equally hard to add or we will just accumulate technical debt.
Other
- We are still planning on doing a 7.3.1 release in the near future.
- Currently we blindly merge readers that are passed to IndexWriter.addIndexes(CodecReader). We should improve the validation of these readers to be similar to what we check at index time, such as the fact that offsets go forward.
- We could better cross-check terms with norms in CheckIndex.
- The check for pending deletes that IndexWriter performs at construction time has been moved to the Directory API so that it doesn't only work with FSDirectory's sub classes.
- When someone searches for a NEAR a, should a document that contains a single occurrence of a match?
- UAX29URLEmailTokenizer fails to recognize some URLs.
- Numeric and binary doc-value updates now share more code.
- Interrupting threads that are calling NIO APIs is unsafe.
- Should WordDelimiterFilter ignore tokens that are marked as keywords? Even though it is not documented, the keyword marker has been mostly used for stemming until now so this could be a surprising change.
- DocumentsWriterFlushQueue is now simpler.
- The term() method of the new MatchesIterator API proved problematic to implement for phrases, so Alan removed it until we have a better idea of the use-case and how to expose the information.
- Separating IndexWriter from related classes helps simplify locking.