27 de abril de 2018

This Week in Elasticsearch and Apache Lucene - 2018-04-27

Por

Elasticsearch

We have merged work replacing hard deletes with soft deletes soft-deletes into the CCR (Cross Cluster Replication) branch. Soft deletes are the first, and most difficult, step in allowing us to read the history of operations from Lucene rather then the translog. Moving to Lucene will allow us to use lucene's power for sorting and searching and thus avoid building similar capabilities in the translog.

Work on zen2 continues, both on the membership node for the new zen2 discovery code, as well as a formal proof for the new direction of zen2.

Opened PR for application privileges, soliciting feedback from the Kibana team to ensure this will meet their needs.

We have updated the plan to remove types so that the Elasticsearch APIs can become typeless by default as of 8.0 rather than 9.0. In 7.0 we will emit a deprecation warning whenever users don't set include_type_name=false when calling an API and make Elasticsearch fail requests when include_type_name is set to true in 8.0 (https://github.com/elastic/elasticsearch/pull/29586)

Changes in 6.2:

X-Pack:

Execute watcher lifecycle changes in order
Watcher: use same value of watcher state (6.x)

Changes in 6.3:

SQL: Add BinaryMathProcessor to named writeables list #30127
Fix TermsSetQueryBuilder.doEquals() method #29629
Deprecate the suggest metrics #29627
Implement Iterator#remove for Cache values iter #29633
Never leave stale delete tombstones in version map #29619
X-Pack:

Security: cache users in PKI realm
SQL: implement COT, RANDOM, SIGN math functions
SQL: Add Atan2 and Power functions
[ML] Prevent unnecessary job updates.
SQL: Fix version loading in JDBC

Changes in 6.4:

In the field capabilities API, deprecate support for providing fields in the request body. #30157
Add additional shards routing info in ShardSearchRequest #29533
Do not add noop from local translog to translog again #29637

Changes in 7.0:

Fix a bug in FieldCapabilitiesRequest#equals and hashCode. #30181
BREAKING: Remove the suggest metric from stats APIs #29635
Add support for field capabilities to the high-level REST client. #29664
X-Pack:

Watcher: use same value of watcher state

Lucene

Backward compatibility

A discussion about adding a way to rewrite all segments to ease upgrading turned into a discussion about backward compatibility. Some users merge segments initially created by version (X-1) with version X so that they can be read with version (X+1). While this solves the upgrading issue at the file format level, there are still issues with everything that is performed on top of codecs and file formats such as analysis, input validation and the encoding of norms. Lucene 8 introduces a restriction that prevents you from reading indices that have been created before Lucene 7 due to the fact that Lucene 7.0 removed index boosts and changed the way that norms are encoded.

ConditionalTokenFilter

Introduce a new ConditionalTokenFilter which allows to implement conditional logic in analysis chains. For instance you could decide to apply different token filters based on the character set that a token is using.

Removing memory codecs

Memory codecs occasionally cause out-of-memory errors in tests that index many documents since they are not memory-efficient. It was suggested that they be removed, but it was argued that they may be useful to users who need fast terms-dictionary lookups. In the end there seems to be agreement that we should still remove them and that using MMap preloading is a better option than forcefully loading terms into memory, but this triggered an interesting discussion regarding the fact that if we want features to be harder to remove, we will also need to make them equally hard to add or we will just accumulate technical debt.

Other

We are still planning on doing a 7.3.1 release in the near future.
Currently we blindly merge readers that are passed to IndexWriter.addIndexes(CodecReader). We should improve the validation of these readers to be similar to what we check at index time, such as the fact that offsets go forward.
We could better cross-check terms with norms in CheckIndex.
The check for pending deletes that IndexWriter performs at construction time has been moved to the Directory API so that it doesn't only work with FSDirectory's sub classes.
When someone searches for a NEAR a, should a document that contains a single occurrence of a match?
UAX29URLEmailTokenizer fails to recognize some URLs.
Numeric and binary doc-value updates now share more code.
Interrupting threads that are calling NIO APIs is unsafe.
Should WordDelimiterFilter ignore tokens that are marked as keywords? Even though it is not documented, the keyword marker has been mostly used for stemming until now so this could be a surprising change.
DocumentsWriterFlushQueue is now simpler.
The term() method of the new MatchesIterator API proved problematic to implement for phrases, so Alan removed it until we have a better idea of the use-case and how to expose the information.
Separating IndexWriter from related classes helps simplify locking.

Elasticsearch Platform

ELK Stack

Elastic Cloud

Observability

Security

Search

Por setor

Por solução

Cliente em destaque

Desenvolvedores

Conectar-se

Aprender

Ajuda

Veja o que está acontecendo na Elastic

This Week in Elasticsearch and Apache Lucene - 2018-04-27

Elasticsearch

Changes in 6.2:

Changes in 6.3:

Changes in 6.4:

Changes in 7.0:

Lucene

Backward compatibility

ConditionalTokenFilter

Removing memory codecs

Other

Siga-nos

Sobre nós

Junte-se a nós

Imprensa

Parceiros

Confiança e segurança

Relações com investidores

EXCELLENCE AWARDS