This Week in Elasticsearch and Apache Lucene - 2018-06-01
Elasticsearch
Cross Cluster Replication
The primary focus last week was on benchmarking, specifically getting to a point where we index full speed to an index in one region and the follower index in another region keeps up. If you remember, in last week update we reported indexing a 30GB data set and that was fully replicated but it took ~11 hours to complete. After increasing the number of workers and the batch size, we can now do a full run in 1h10m, which is roughly what it takes to index the data. From here we proceed to formalize our benchmarking infra. The plan is to introduce multiple data types, multiple platforms, nightly runs and more.
On the API side, we have merged a PR to convert an existing index to a follower, and also added support for syncing the mappings between the leader and following index. The latter is important for use cases like logging where fields can be introduced during indexing. Without syncing those fields, we can't index the new data on the leader.
Improvements in Cross Cluster Search
We have merged a PR that ensures we do not use dedicated master nodes as the gateway node for the remote cluster. This is an important enhancement as it ensure we do not stress remote master nodes by making them coordinate the remote part of a cross cluster search.
Search Templates now throw 400 (bad request) on a ScriptException
We have merged a PR which changes the status code returned when a ScriptException is thrown from a 500 (Internal Server Error) to a 400 (Bad Request). This means that if the user has a bug in their script, we will report that request is invalid and the user should correct the request, rather than returning a 500, which would incorrectly indicate a server error.
SQL
Work continues on the ODBC driver, which allows connectivity between BI tools like Tableau to Elasticsearch. This week, we have been working on data conversion; the date, time and timestamp conversion have been implemented, interval and GUID types are handled as well (proper rejection codes provided).
Changes in 5.6:
- Fsync state file before exposing it #30929
Changes in 6.3:
Changes in 6.4:
- Harmonize include_defaults tests #30700
- Refactor Sniffer and make it testable #29638
- Deprecate accepting malformed requests in stored script API #28939
- REST high-level client: add synced flush API (2) #30650
- Reuse expiration date of trial licenses #30950
- Transport client: Don’t validate node in handshake #30737
- HLRest: Allow caller to set per request options #30490
- Deprecates indexing and querying a context completion field without context #30712
- Cross Cluster Search: do not use dedicated masters as gateways #30926
- Fix AliasMetaData#fromXContent parsing #30866
- Add Verify Repository High Level REST API #30934
- BREAKING: Include size of snapshot in snapshot metadata #29602
- Add missing_bucket option in the composite agg #29465
- stable filemode for zip distributions #30854
- Fix IndexTemplateMetaData parsing from xContent #30917
- Limit the scope of BouncyCastle dependency #30358
- Move list tasks API under tasks namespace #30906
- SQL: Remove the last remaining server dependencies from jdbc #30771
- Verify signatures on official plugins #30800
- Fix bad version check writing Repository nodes #30846
Changes in 7.0:
- Remove version read/write logic in Verify Response #30879
- BREAKING: Core: Remove RequestBuilder from Action #30966
- Add “took” timing info to response for _msearch/template API #30961
- Change ScriptException status to 400 (bad request) #30861
- BREAKING: Include size of snapshot in snapshot metadata #30890
- Change BWC version for VerifyRepositoryResponse #30796
Lucene
Soft deletes
Soft deletes were initially implemented on top of the Lucene index, but recent changes have tightened the integration with IndexWriter. We've started discussions about preventing from changing the field that is used to record soft deletes and recording the number of soft deletes into segment infos.
Other
- Fixing UAX29URLEmailTokenizer, which was not detecting urls for top-level domains for which the one-letter-shorter prefix was also a TLD.
- Proposal of a merge simulator, whose goal is to help assess the efficiency of a merge policy and possibly help tune parameters.
- Could Lucene efficiently boost based on dynamic features such as recency or geo distance?
- We've added an "unordered distinct" intervals source that allows to force matches to occur at distinct positions, so that if you search for "a NEAR a" then a single occurrence of "a" won't be considered a match.
- We made SegmentReader.getSegmentInfo return a snapshot.
- Work on exposing the CompletionTokenStream as a more general-purpose ConcatenateGraphTokenStream that concatenates terms for all paths in the token graph.
- Explored using a more compact representation of live docs but it also made query processing slower.
- cleaned up the ALWAYS_CACHE caching policy, that should only be used for testing.