This Week in Elasticsearch and Apache Lucene - 2019-03-15
We published a blog post about Zen2. As the blog title suggests, it is truly a new era for cluster coordination in Elasticsearch.
We also published docs for the “elasticsearch-node” command line tool that allows to unsafely recover a cluster after a majority of master-eligible nodes have been permanently lost.
Snapshot Lifecycle Management
We now have a feature branch for snapshot lifecycle management, named snapshot-lifecycle-management. It’s the first step towards having it as a feature and includes the framework for put/get/delete snapshot lifecycles.
We fixed a bug that made SQL report an incorrect error when a field in a hierarchy of sub-fields contained an array of values (as opposed to a single value). This was a bit more challenging than reporting the correct error type because the entire idea of a sub-field containing an array of values or other sub-fields was ignored. This was discovered while testing the ODBC driver with Kibana’s default sample data. The PR is not yet merged due to a infra bug that makes integration testing in Windows impossible.
We implemented support for Cloud ID. With this, a user will be able to simply use the string copied from the cloud admin interface to configure the connection to the cloud; the driver will then fill in the address of the Elasticsearch cluster, the port and connection security settings.
We worked on introducing the SQL TIME data type which eliminates the date info from a timestamp and enables users to do various calculations with it, e.g. :
SELECT * FROM t WHERE CAST(my_date AS TIME) < CAST(‘10:00:00` AS TIME)
SELECT count(*), CAST(my_date AS TIME) as time FROM t GROUP BY time
Also during this implementation, investigation led to the discovery of a bug regarding timezones and java.sql.Date objects in the JDBC driver, which led to a bug fix.
We opened a PR for adding ST_Distance function (e.g. distance between two points). We also fixed a bug in handling timestamps in aggregations that he stumbled upon while working on adding ST_Distance.
Test Test Test!!
We continued work to add tests to our apps. This week the effort was focused on adding API integration tests for rollups. We also have a PR open that moves testbed utils to a common folder,
x-pack/test_utils, so that other apps can use them. We’ve also been performing a lot of work testing upgrades of the stack to 7.0.
We opened a PR that concludes the work on minimizing round-trips for CCS requests. This is the last step mentioned in the meta issue about comparing output from the two “execution modes”. The new integration tests already caught a couple of issues: error upper bound may be wrong when performing incremental reductions, also dfs_reduce_then_fetch gives different scoring when minimizing round-trips. Also, top_hits does different tie_breaking hence results are sorted differently when top_hits have the same score. It’s all minor things luckily, yet good to know and to think about addressing.
Doc Values for Geo Shapes
Now that BKD-based geoshapes are in, we want to start working on adding doc values to them. This will, in short, allow shapes to be aggregated upon (and there are lots of neat aggregations you could do on shapes). We started working on a POC of doc value support. The POC uses existing Lucene data structures, but the final implementation will use more custom and serialization-friendly datastructures for ES. Now that the POC stage is done, next step is to work on the actual implementation. We intend to start a new feature-branch to contain the first release, which will bring parity with the geo-point aggregations. This will start with the new serializable edge-tree.
We added documentation for the CCR index following lifecycle, which for example includes instructions on how to fix a follower index that has fallen too far behind its leader index.
We fixed a rolling upgrade issue between 5.6 and 6.7 where the retention leases background sync was returning an illegal checkpoint.
Nightly Benchmarks for ML
We started working on setting up ML jobs and alerts for our nightly results to automatically help us detect anomalies. This important automation will make it much easier to catch any ML-related performance regressions.
We worked on two changes related to the types removal effort:
- The monitoring bulk API no longer emits deprecation warnings when a
_typeis provided. This might look surprising but this is due to the fact that the bulk monitoring API reuses the parsing logic of the
bulkAPI but uses the
_typein a way that has nothing to do with mapping types, so this warning was not relevant.
- The percolate query no longer requires a type when percolating an existing document.
Queryable Object Fields
We resumed the work on the queryable object field. We added some short-circuiting logic to prevent slow lookups on non-existent fields. We also started to work on adding docvalues fields support for this field so we could support sorting and simple aggregations like ‘terms” that make sense for keyword style fields.
The Lucene release vote for 8.0.0 has passed and we are working on the next steps to wrap up this major release. The artifacts are already published and they are currently replicating on the different mirrors. We are planning to publicly announce the release on Friday March 15th. Elasticsearch 7 and higher has already been upgraded to the release artifacts.
We merged a patch that adds the ability to de-boost specific terms in the SynonymQuery. We plan to use this new feature in Elasticsearch to ensure that the original term gets a better score than its synonym expansions. We closed & merged a community PR that changes how we match words in the Korean’s user dictionary. Instead of adding all the matches at each offset we now select the longest match and discards the other. This allows to add compound terms in the user dictionary along with their de-compounded forms while ensuring that the compound term is prioritized. We’ve worked on a possible solution to stop-words in the token graph which surfaced a bug in FixedShingleFilter.
Query Visitor API
A group of committers are iterating on the final API to allow efficiently walking query trees. We seem to be close and expect this to land in the next couple of days. Now the discussion about back-porting and backwards compatibility has started.
We fixed a bug where sometimes we missed the intersection between shapes when the crossing happens at a segment point and added tests to the Line2D implementation which is the object containing the spatial logic for lines. The support for Contains seems to be in good shape and given the progress we expect it to land by the end of next week.
We opened a PR to allow more fine-grained control over term-dictionaries being ram resident or read off-heap. The change seems controversial but hasn’t seen many comments not iterations yet.
We are working on fixing a bug in ValueSource that can cause IndexSearcher references to leak.
We are still trying to figure out what we should do in case of an exception during IndexWriter#commit and friends.
Toke Eskildsen, a long-term member of the Lucene community, published a blog post about the changes that he did to the codec in order to speed up advancing on sparse doc values.
Changes in Elasticsearch
Changes in 7.1:
- Add a Painless Context REST API #39382
- Fix not Recognizing Disabled Object Mapper #39862
- Handle UTF-8 values in the keystore #39496
- GEO: Add support for z values to libs/geo classes #38921
- Ensure sendBatch not called recursively #39988
- Add flag to declare token filters as updateable #36103
- Avoid copying the field alias lookup structure unnecessarily. #39726
- Only connect to new nodes on new cluster state #39629
- Log missing file exception when failing to read metadata snapshot #32920
- Do not swallow exceptions in TimedRunnable #39856
Changes in 7.0:
- CCS: Disable minimizing round-trips when dfs is requested #40044
- Some elasticsearch-cli tools could not be run not from ES_HOME #39937
- Fix node tool cleanup #39389
- Upgrade to Lucene 8.0.0 #39992
- Do not log unsuccessful join attempt each time #39756
- Make the
typeparameter optional when percolating existing documents. #39987
- Add client jar for transport-nio #39860
- Change zone formatting for all printers #39568
- Remove types from internal monitoring templates and bump to api 7 #39888
- [DOCS] Fixes breaking changes for low level client #39930
Changes in 6.7:
- BREAKING: Stop returning cluster state size by default #40016
- Upgrade the bouncycastle dependency to 1.61 #40017
- Deprecation check for indices with very large numbers of fields #39869
- Enforce retention leases require soft deletes #39922
- plugins/repository-gcs: Update google-cloud-storage/core to 1.59.0 #39748
Changes in 6.6:
- SQL: Fix bug with JDBC timezone setting and DATE type #39978
- SQL: Extend the multi dot field notation extraction to lists of values #39823
- SQL: Wrap ZonedDateTime parameters inside scripts #39911
- SQL: ConstantProcessor can now handle NamedWriteable #39876
Changes in Elasticsearch Management UI
Changes in 7.1:
- [Rollup] Redux integration test job creation #32223
- [Rollup] Api integration test wildcard + search #32884
- [Rollup] API integration tests #32780
- Instrument Index Management with user action telemetry #32595
- Remove license restrictions from License Management. #33046
- Instrument Rollups CRUD app with user action telemetry #32534
- Fix ‘Create Rollup Index Pattern’ button badge color error #32954
- remove rollup section in advanced settings for OSS #32814
Changes in Elasticsearch SQL ODBC Driver
Changes in 7.1:
- Add support for Cloud ID #125
Changes in 6.6:
- Add dependencies reporting batch #126