This Week in Elasticsearch and Apache Lucene - 2018-10-26

Elasticsearch

Cross Cluster Replication (CCR)

We added docs for the CCR APIs, simplified some of the CCR APIs, and aligned the terminology used in the CCR APIs to be closer to what we have for Cross Cluster Search. We also added a new byte-based limit for the write buffer that’s used on the follower to temporarily hold the documents that need to be pushed to the follower shards. The current limit was based on the count of documents in the buffer, making it difficult to come up with a good default as the memory usage then depended on the size of the documents. When the new size limit is hit (512MB by default), fetching changes from the leader cluster will be paused to give the follower shards a chance to catch up with the current documents in the buffer.

We continued investigating the effect of concurrent connections on cross-region traffic. We have also measured the read load created by CCR on the leader cluster vs. write load created on follower cluster, and studied the overhead that CCR has on the indexing rate of the leader cluster, findings are summarized here. The coming weeks will focus on further benchmarking and testing, so that we can ship CCR with optimal defaults.

Improving indexing performance with selective flushing

Currently when we perform a flush on a shard (when we write the segments in memory to disk) we flush all segments in memory at the same time. This can lead to situations where some of the segments we write to disk are large (which is good) and others are small (which is not so good as these will end up needing to be merged). We have raised a PR which will change this behaviour so on a flush we will only write the largest segment in memory to disk. This should mean that the segments flushed to disk are generally larger, require fewer merges and ultimately increase indexing throughput. We will benchmark this change to ensure it has the effect that we are expecting.

More functionality for the SQL plugin

We have been working hard to implement new functionality in SQL. We have added support for value IN (v1, v2, v3...) expressions, implemented CONVERT, an alternative to CAST with a different syntax (required for ODBC), and is currently working on handling the NOT operator properly which has some complex behaviour when mixed with certain other operators.

We have also been working on adding support for NULL values and IP fields, and merged a change that allows queries over multiple indexes where the mapping are not identical between the indexes (although they still need to be compatible).

Types Removal for 7.0

We have merged some PRs to make sure that our docs only use a _doc type and also to remove the now obsolete "type exist" query. There is ongoing discussion around how we migrate user from the typed APIs from 6.x to the typeless APIs in 7.0 in a way that is friendly for both existing users upgrading their clusters and users deploying new clusters.

Changes in 5.6:

  • Dependencies: Upgrade to joda time 2.10 #34760
  • Add a limit for graph phrase query expansion #34061
  • [ML] Add missing return after calling listener #4470

Changes in 6.2:

  • Make accounting circuit breaker settings dynamic #34372

Changes in 6.4:

  • Check self references in metric agg after last doc collection (#33593) #34001

Changes in 6.5:

  • Adding stack_monitoring_agent role #34369
  • SQL: Return error with ORDER BY on non-grouped. #34855
  • Don’t omit default values when updating routing exclusions (#32721) #33638
  • SQL: Allow min/max aggregates on date fields #34699
  • [HLRC] Add support for Delete role mapping API #34531
  • ingest: better support for conditionals with simulate?verbose #34155
  • ingest: processor stats #34724
  • SQL: Fix edge case: <field> IN (null) #34802
  • SQL: handle X-Pack or X-Pack SQL not being available in a more graceful way #34736
  • SQL: Introduce ODBC mode, similar to JDBC #34825
  • Improved IndexNotFoundException’s default error message #34649
  • NETWORKING: Add SSL Handler before other Handlers #34636
  • HLRC: Deactivate Watch API #34192
  • SQL: Introduce support for IP fields #34758
  • SQL: Fix queries with filter resulting in NO_MATCH #34812
  • SQL: Verifier allows aliases aggregates for sorting #34773
  • Fix inner_hits retrieval when stored fields are disabled #34652
  • SQL: Implement null handling for IN(v1, v2, ...) #34750
  • Add cluster-wide shard limit warnings #34021
  • Scripting: Add back params._source access in scripted metric aggs #34777
  • Test: Fix last reference to SearchScript #34731
  • SQL: Support pattern against compatible indices #34718
  • SQL: Implement IN(value1, value2, …) expression. #34581
  • SQL: Implement CONVERT, an alternative to CAST #34660
  • SQL: the SSL default configuration shouldn’t override the https protocol if used #34635
  • Handle missing user in user privilege APIs #34575
  • Security: don’t call prepare index for reads #34568
  • Fill LocalCheckpointTracker with Lucene commit #34474
  • HLRC: Delete role API #34620
  • BREAKING: Geo: Don’t flip longitude of envelopes crossing dateline #34535
  • SQL: Introduce support for NULL values (#34573) #34640
  • Fix completion suggester’s score tie-break #34508
  • BREAKING: HLRC XPack Protocol clean up: Licence, Misc #34469
  • [HLRC] Add Start/Stop Watch Service APIs. #34317
  • Check stemmer language setting early #34601

Changes in 6.6:

  • [Painless] Add instance bindings #34410
  • XContent: Check for bad parsers #34561
  • check for null argument is already done in splitStringByCommaToArray #34268

Changes in 7.0:

  • HLRC API for _termvectors #33447
  • Empty GetAliases authorization fix #34444
  • Core: Move IndexNameExpressionResolver to java time #34507
  • Deprecate type exists requests. #34663
  • SQL: Introduce support for NULL values #34573
  • BREAKING: Fix threshold frequency computation in Suggesters #34312
  • Remove hand-coded XContent duplicate checks #34588
  • Generate non-encrypted license public key #34626

Apache Lucene

Geo

  • We have merged all the building blocks required to handle BKD shapes in Lucene, and are now working on exposing this work to Elasticsearch.
  • improved the LatLonShape encoding to speed up search at the cost of a bigger index.
  • Chasing a complex bug in the new bkd based shape field

Snowball arabic stemmer

We have contributed a snowball arabic stemmer that will also be available in Lucene 7.6.

Arabic language needs good stemming for effective information retrieval. Most Arabic dictionaries arrange their entries alphabetically by root. Meaning that to find the definition of a word, one must find its root first. This contribution extends the set of snowball based stemmers currently supported by Lucene by adding Arabic.

The Arabic snowball algorithm was generated from https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl. The algorithm is inspired by LarkeyBC02 (https://dl.acm.org/citation.cfm?doid=564376.564425) with an extended support for special cases as well as a normalization step before stemming.

Korean tokenization

  • The Korean dictionary used by Nori's tokenizer had some invalid rules with empty tokens and trailing spaces. These entries have been normalized (or removed).
  • The Korean tokenizer now considers the Hangul Letter Araea (interpunct) as a separator
  • The Korean tokenizer should not split on compatible scripts boundaries.

Other

  • We switched back to phrase query when building a graph phrase query with a slop greater than 0 in order to ensure consistency with other phrase queries.
  • Should we have a way to build unordered span queries in the QueryBuilder?
  •  We are working on a type safe implementation of the normalization of terms applied in token filters.
  • The word delimiter filter can produce backward offsets when used in combination with other filters.