This Week in Elasticsearch and Apache Lucene: 2020-02-28 | Elastic Blog

This Week in Elasticsearch - 2020-02-28

Elasticsearch

Better query cancellation

When a user cancel a search request whether automatically by closing the HTTP channel or explicitly through the tasks API, we propagate the cancellation to all nodes and also cancels the child tasks that are running per shard to collect/fetch the documents. We do this eagerly so that child tasks running on shards can stop wasting CPU and I/O in the middle of any query if the parent search request is cancelled. This mechanism works well in practice on regular queries and cancellation is detected quickly. However, there are still some gaps in the following operations:

  • Terms dictionary expansion (fuzzy, regex, wildcard queries)
  • Complex BKD reads (geo and range queries)

We have opened a PR to handle these cases more efficiently by wrapping the original index reader. This allows to check for cancellation when accessing the terms dictionary and the BKD reader efficiently from any query and therefore cancel the operation more promptly.

Mapping Validation

Historically, validation of mappings via dynamic templates is lenient and misconfigurations are only discovered when indexing documents with an unmapped field. For example:

PUT my_index
{
  "mappings": {
    "dynamic_templates": [
      {
        "my_text": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "text",
            "analyzer": "not_a_valid_analyzer_why_no_error?"
          }
        }
      }
    ]
  }
}

is accepted and index creation succeeds. When subsequently indexing a document with a string type however, the system will respond to the user that there is a mapping misconfiguration. This give a bad experience to our users especially as there can be a large gap of time between when the index is created and when the first error appears.

We have recently spent some time addressing the lack of validation. As of 8.0 the above example will be rejected at mapping update time (which also includes index creation). The validation is still best effort since there are some cases that can not be validated at update time but this should improve the experience for our users. See #51233 for additional details.

Java and Joda Time

The migration from Joda to Java time happened in 7.0, but users have been having a difficult time migrating. Although we have been dealing with bugs reported in Java time behavior over the last couple months, it became apparent we did not provide the necessary backwards compatibility guarantees that we should have. To address this, we are adding back support for pre-7.0 indices using Joda style datetime formats in 7.x along with improved migration docs.

Lucene

Snowball

Today Lucene uses a pre-compiled version of Snowball to provide machine-generated stemmers and stop words lists in many languages. This mechanism allows us to provide stemming support in Hungarian, Turkish and Arabic, to quote a few. However, a user complained that the version we're using is now outdated as it was created 12 years ago from a specific commit so we were missing a lot of new languages addition and improvements. Snowball released a v2.0 at the end of last year so the necessary changes have been made to create a pre-compiled version based on this new release. As of Lucene 9.0 (Elasticsearch 8.0), we'll be using this new release that provides support for Hindi, Indonesian, Nepali, Serbian and Tamil out of the box.

CheckIndex Integrity

When opening an index, Lucene does some lightweight verifications that the index is not corrupted. We cannot check eagerly because that would require reading the entirety of large files so we have another tool called CheckIndex that can be run offline to perform a full check. However an Elasticsearch test revealed that some checking were missing in the compressed stored fields. While this is a bug that requires a fix, a PR was opened to ensure that tests in Lucene can detect when we don't implement these checks correctly. This will help uncover this kind of bug earlier in the future.

Korean Dictionary

The Korean dictionary uses an internal dictionary that is shipped within the Lucene jar.

This prevents user from providing their own dictionary since we don't allow any override.

They can still provide additional terms but they have no way to replace the main dictionary other than re-generating the jar.

A proposal was made to adapt the mechanism that was added for the Japanese tokenizer. This should ease the experience for users that need to create domain-specific dictionaries in Korean.

Changes

Breaking Changes in Elasticsearch

Breaking Changes in 8.0:

  • Percentiles aggregation: disallow specifying same percentile values twice #52257

Breaking Changes in 7.7:

  • Add validation for dynamic templates #51233

Breaking Changes in Elasticsearch Hadoop Plugin

Breaking Changes in 8.0:

  • Boost default scroll size to 1000 #1429