This Week in Elasticsearch and Apache Lucene - 2018-10-26
Cross Cluster Replication (CCR)
We added docs for the CCR APIs, simplified some of the CCR APIs, and aligned the terminology used in the CCR APIs to be closer to what we have for Cross Cluster Search. We also added a new byte-based limit for the write buffer that’s used on the follower to temporarily hold the documents that need to be pushed to the follower shards. The current limit was based on the count of documents in the buffer, making it difficult to come up with a good default as the memory usage then depended on the size of the documents. When the new size limit is hit (512MB by default), fetching changes from the leader cluster will be paused to give the follower shards a chance to catch up with the current documents in the buffer.
We continued investigating the effect of concurrent connections on cross-region traffic. We have also measured the read load created by CCR on the leader cluster vs. write load created on follower cluster, and studied the overhead that CCR has on the indexing rate of the leader cluster, findings are summarized here. The coming weeks will focus on further benchmarking and testing, so that we can ship CCR with optimal defaults.
Improving indexing performance with selective flushing
Currently when we perform a flush on a shard (when we write the segments in memory to disk) we flush all segments in memory at the same time. This can lead to situations where some of the segments we write to disk are large (which is good) and others are small (which is not so good as these will end up needing to be merged). We have raised a PR which will change this behaviour so on a flush we will only write the largest segment in memory to disk. This should mean that the segments flushed to disk are generally larger, require fewer merges and ultimately increase indexing throughput. We will benchmark this change to ensure it has the effect that we are expecting.
More functionality for the SQL plugin
We have been working hard to implement new functionality in SQL.
We have added
value IN (v1, v2, v3...) expressions, implemented CONVERT, an alternative to CAST with a different syntax (required for ODBC), and is currently working on handling the NOT operator properly which has some complex behaviour when mixed with certain other operators.
We have also been working on adding support for NULL values and IP fields, and merged a change that allows queries over multiple indexes where the mapping are not identical between the indexes (although they still need to be compatible).
Types Removal for 7.0
We have merged some PRs to make
sure that our docs only use a
_doc type and also to remove the now obsolete "type exist" query. There is ongoing discussion around how we migrate user from the typed APIs from 6.x to the typeless APIs in 7.0 in a way that is friendly for both existing users upgrading their clusters and users deploying new clusters.
Changes in 5.6:
Dependencies: Upgrade to joda time 2.10 #34760
Add a limit for graph phrase query expansion #34061
[ML] Add missing return after calling listener #4470
Changes in 6.2:
Make accounting circuit breaker settings dynamic #34372
Changes in 6.4:
Check self references in metric agg after last doc collection (#33593) #34001
Changes in 6.5:
Adding stack_monitoring_agent role #34369
SQL: Return error with ORDER BY on non-grouped. #34855
Don’t omit default values when updating routing exclusions (#32721) #33638
SQL: Allow min/max aggregates on date fields #34699
[HLRC] Add support for Delete role mapping API #34531
ingest: better support for conditionals with simulate?verbose #34155
ingest: processor stats #34724
SQL: Fix edge case: <field> IN (null) #34802
SQL: handle X-Pack or X-Pack SQL not being available in a more graceful way #34736
SQL: Introduce ODBC mode, similar to JDBC #34825
Improved IndexNotFoundException’s default error message #34649
NETWORKING: Add SSL Handler before other Handlers #34636
HLRC: Deactivate Watch API #34192
SQL: Introduce support for IP fields #34758
SQL: Fix queries with filter resulting in NO_MATCH #34812
SQL: Verifier allows aliases aggregates for sorting #34773
Fix inner_hits retrieval when stored fields are disabled #34652
SQL: Implement null handling for IN(v1, v2, ...) #34750
Add cluster-wide shard limit warnings #34021
Scripting: Add back params._source access in scripted metric aggs #34777
Test: Fix last reference to SearchScript #34731
SQL: Support pattern against compatible indices #34718
SQL: Implement IN(value1, value2, …) expression. #34581
SQL: Implement CONVERT, an alternative to CAST #34660
SQL: the SSL default configuration shouldn’t override the https protocol if used #34635
Handle missing user in user privilege APIs #34575
Security: don’t call prepare index for reads #34568
Fill LocalCheckpointTracker with Lucene commit #34474
HLRC: Delete role API #34620
BREAKING: Geo: Don’t flip longitude of envelopes crossing dateline #34535
SQL: Introduce support for NULL values (#34573) #34640
Fix completion suggester’s score tie-break #34508
BREAKING: HLRC XPack Protocol clean up: Licence, Misc #34469
[HLRC] Add Start/Stop Watch Service APIs. #34317
Check stemmer language setting early #34601
Changes in 6.6:
[Painless] Add instance bindings #34410
XContent: Check for bad parsers #34561
check for null argument is already done in splitStringByCommaToArray #34268
Changes in 7.0:
HLRC API for _termvectors #33447
Empty GetAliases authorization fix #34444
Core: Move IndexNameExpressionResolver to java time #34507
Deprecate type exists requests. #34663
SQL: Introduce support for NULL values #34573
BREAKING: Fix threshold frequency computation in Suggesters #34312
Remove hand-coded XContent duplicate checks #34588
Generate non-encrypted license public key #34626
- We have merged all the building blocks required to handle BKD shapes in Lucene, and are now working on exposing this work to Elasticsearch.
- improved the LatLonShape encoding to speed up search at the cost of a bigger index.
- Chasing a complex bug in the new bkd based shape field
Snowball arabic stemmer
We have contributed a snowball arabic stemmer that will also be available in Lucene 7.6.
Arabic language needs good stemming for effective information retrieval. Most Arabic dictionaries arrange their entries alphabetically by root. Meaning that to find the definition of a word, one must find its root first. This contribution extends the set of snowball based stemmers currently supported by Lucene by adding Arabic.
The Arabic snowball algorithm was generated from https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl. The algorithm is inspired by LarkeyBC02 (https://dl.acm.org/citation.cfm?doid=564376.564425) with an extended support for special cases as well as a normalization step before stemming.
- The Korean dictionary used by Nori's tokenizer had some invalid rules with empty tokens and trailing spaces. These entries have been normalized (or removed).
- The Korean tokenizer now considers the Hangul Letter Araea (interpunct) as a separator
- The Korean tokenizer should not split on compatible scripts boundaries.
- We switched back to phrase query when building a graph phrase query with a slop greater than 0 in order to ensure consistency with other phrase queries.
- Should we have a way to build unordered span queries in the QueryBuilder?
- We are working on a type safe implementation of the normalization of terms applied in token filters.
- The word delimiter filter can produce backward offsets when used in combination with other filters.