This Week in Elasticsearch and Apache Lucene - 2018-09-28

Elasticsearch

Highlights

Annotated Text Plugin

We have merged the annotated_text plugin which allows users to index text that is a combination of free-text and special markup that is typically used to identify items of interest such as people or organizations (see NER or Named Entity Recognition tools). This new plugin will be available from 6.5.

Watcher

With the release of 6.4.1 we fixed an important distributed watch execution bug, that could lead to watches being distributed twice or not at all. To ensure support is aware we have written and marketed a knowledge base article.

Some customers push Watcher to its limits and one area where this shows up is ineffective caching. We have implemented an optimization so we detect and avoid unnecessary compilation of templates that clearly don't require compilation and thus can be returned immediately as is instead of being put into a cache.

Questions around removal of types

We have been working on the changes required in 7.0 to support the removal of types from the index APIs. This has raised a number of new discussions:

Replicated Closed Indices

Last week we revived the NoOpEngine (the engine that will be used by closed indices and has 0 memory overhead). We have also opened a PR to add entries in the routing table for shards of closed indices. The PR also adds logic to detect the transition between close and open indices on the data nodes and use to replace existing open shards (i.e., normal engines) with closed ones and vice versa. Moving forward, closing an index will require all ongoing indexing to the that index to finish, that the global checkpoint has been fully caught up (something that happens quickly once indexing is stopped) and that all operations in the translogs are committed to Lucene.

Zen2

Achieving strong consistency properties for Zen2 requires a reliable storage mechanism for cluster states. The current storage mechanism for cluster states does not atomically write out a full cluster state, providing atomicity guarantees only at the per-index level. We’ve decided to move forward with a simple model that extends the current cluster state storage approach, adding the capability for atomically writing out a full cluster state. Our step-by-step plan is recorded on a Github meta issue and we have already started to actively work on the first steps.

Changes

Changes in 5.6:

  • Security: use default scroll keepalive #33639

Changes in 6.4:

  • Add a limit for graph phrase query expansion #34031
  • Fix AutoQueueAdjustingExecutorBuilder settings validation #33922

Changes in 6.5:

  • SQL: Internal refactoring of operators as functions #34097
  • ingest: correctly measure chained pipeline stats #33912
  • XContentBuilder to handle BigInteger and BigDecimal #32888
  • [Monitoring] Update beats version #34060
  • SQL: Prevent StackOverflowError when parsing large statements #33902
  • Bad regex in CORS settings should throw a nicer error #34035
  • [HLRC] Support for role mapper expression dsl #33745
  • Watcher: Reduce script cache churn by checking for mustache tags #33978
  • Fold EngineSearcher into Engine.Searcher #34082
  • Build DocStats from SegmentInfos in ReadOnlyEngine #34079
  • Calculate changed roles on roles.yml reload #33525
  • Unmapped aggs should not run pipelines if they delegate reduction #33528
  • Add nested and object fields to field capabilities response #33803
  • SQL: Fix query translation of GroupBy with Having #34010
  • Add minimal sanity checks to custom/scripted similarities. (backport) #33893
  • Clarify RemoteClusterService#groupIndices behaviour #33899
  • NETWORKING: Upgrade to Netty 4.1.29 #33984
  • [Monitoring] Add cluster metadata to cluster_stats docs #33860
  • SQL: functions docs update #34000
  • SQL: Move away internally from JDBCType to SQLType #33913
  • Propagate auto_id_timestamp in primary-replica resync #33964
  • add elasticsearch-shard tool to 6.x #33848

Changes in 7.0:

  • Scripting: Reflect factory signatures in painless classloader #34088
  • Search: Simply SingleFieldsVisitor #34052

Apache Lucene

Spans are hard

Getting span_near queries right is hard. The fact that there might be multiple spans at the same start position and that they might have different lengths causes the current span_near implementation to sometimes omit some combinations. This issue has been open for a long time but a (large - 10k new lines) patch was recently submitted. This might trigger interesting discussions regarding spans vs. intervals as intervals only expose minimum intervals, which helps avoid a number of these problems.

Other

Adoptme

We believe that sorting an index by norm might help the new block-max-WAND optimizations perform better. It would be interesting to introduce the ability to sort by norm at index-time and see whether it improves query performance compared to an index in a random order.