This Week In Elasticsearch and Apache Lucene - 2019-09-13
Highlights
Rally 1.3.0
Rally 1.3.0 was released. The most notable change is Rally stores the track and team revision in the metrics store so it is easier to reproduce benchmark results over longer periods of time in case there are changes to tracks or Elasticsearch configurations. With this release support for Elasticsearch 1.x was also dropped.
Packaging
The next major release of macOS will require applications distributed outside the Mac App Store to be signed and notarized. As part of our effort towards signing and notarizing the components of Elasticsearch (the JDK, JNA, ML binaries), we need a signed and notarized JDK. The Oracle OpenJDK that we are using is not signed and notarized, and doesn't appear that it will be for the next major release of the JDK either. AdoptOpenJDK is a newer distribution of OpenJDK that we were in the process of adding support for starting with 7.4.0, and it turns out that AdoptOpenJDK is already signing and notarizing their distributions. Thus we discussed and agreed to switch our bundled jdk to AdoptOpenJDK, and we opened a PR this week.
Snapshots
We enhanced documentation to include support for OneZoneInfrequentAccess storage by the S3 repository plugin.
We investigated issues with restoring large snapshots on Cloud. These failures were caused by S3 closing connections mid-downloads which is treated as a fatal error by the S3 SDK and not retried. We opened a PR to add retries for this scenario on top of the S3 SDK.
We continued work on enhancing our integration tests for Cloud provider backed repositories and opened various PRs to create a mocked GCS endpoint and enhance other mock HTTP endpoints to allow for testing multiple failure scenarios and validating the retry-on-failure behaviour of the cloud provider's SDKs. This work was already used for adding tests on the handling of broken connections during S3 blob downloads.
We continued work on adjusting the snapshot metadata format to speed up snapshots and limit the negative effects of S3's eventual consistency model on the S3 snapshot repository implementation.
G1GC
We investigated real-memory circuit breaker issues and found that nearly all of the reported issues on the Discuss forum are from deployments using G1GC. We tracked down the issue to sub-optimal G1GC defaults in our default jvm.options and opened a PR to adjust them to more aggressively run collections to make sure that the real-memory circuit breaker threshold would not be exceeded in situations of moderate memory-pressure. Also, we conducted extensive investigations into the performance impact of the proposed change to the JVM defaults to validate that they would not degrade performance.
Better storage of _source
We did a quick prototype that consists of storing the schema and the data of documents in separate fields in order to help compression. It seems to already help on geonames but we wanted to conduct more tests, especially on adversarial cases like CSV-style content with random integers. This idea would be complementary with storing top-level JSON fields in different stored fields because, as we noticed last time we discussed source compression, the Elastic common schema stores everything under objects so storing top-level fields under a separate stored field alone wouldn't help much users of the ECS. It relates to a previous discussion on making ingesting faster by enabling users to send a bulk request that contains one schema and data for many documents, which would allow parsing field names and looking up field mappers only once for entire batches of documents.
UI: Giving hugs by fixing bugs
The ES UI team dedicated the first week of the release cycle to fixing some of our top bugs. We'll make this a standard practice for each release to keep the bug backlog from growing too large. This week we fixed the following:
- We addressed various accessibility issues in Index Management, Index Lifecycle Policies, and Rollup Jobs. This is great for users who use screen readers and improves our compliance for our Federal customers.
- We also fixed a bug in Index Lifecycle Policies where a user was not able to successfully enable the “Move to warm phase on rollover” setting in edit mode. Five users reported being affected by this problem.
- We also fixed a bug in Watcher UI where “0” wasn’t permitted as a threshold value when creating a threshold watch.
- We acted on feedback from Dan Roscigno that the “Lifecycle phase” filter in Index Management didn’t behave as expected. After a healthy amount of discussion, we decided that we should remove the filter from the UI until we can implement a long-term solution.
- We updated Console autocomplete with support for distance_feature queries.
- We added logic that checks whether Cross-Cluster Replication is supported by the user's license and removes it if it's not. This aligns with the behavior of other apps in Kibana and addresses some complaints from users on the forums.
Pivot in SQL
We introduced PIVOT-ing in SQL
A popular transformation function, this creates a statistics table around the pivoting column, being quite popular in BI tools (one of the most voted issues in Kibana is around pivoting: https://github.com/elastic/kibana/issues/5049):
SELECT * FROM
(SELECT browser, request_bytes, country FROM logs)
PIVOT (AVG(request_bytes) FOR country IN ('NL', 'US', 'RO'))
browser | NL | US | RO
------------+-----------------+-----------------+-----------------+
Chrome |48396.28571428572|53216.28571428572|78353.39941364497|
IE |47058.90909090909|34698.31544913364|85082.12264984912|
Mozilla |49767.22342622222|46463.87613649497|43761.97631565941|
Other |44103.90909090290|44323.10345673210|22134.11231234329|
Geo
We opened a PR for adding support for the new xy (cartesian) shape fields to SQL. We also started working on fixing geo_shape edge-case bugs that were uncovered during geo_shape/shape refactoring, fixing handling of west to east linestring, and working on fixing handling of very long linestrings
We completed the initial implementation for adding spatial projection support as an XPack feature extension to the open source geo_shape
field type. The following mapping example demonstrates a use case for indexing geospatial geometries projected in UTM Zone 14N using the GeoJSON Coordinate Reference System (CRS) Format.
"properties" : {
"location" : {
'type" : "geo_shape",
"crs": {
"type": "name",
"properties": {
"name": "EPSG:32614" // UTM Zone 14N
}
}
}
}
The incoming CRS is handled in the X-Pack Spatial extension plugin and indexes the incoming UTM Geometries using lucene's new XYShape field without having to reproject to WGS84 (a previous requirement for all geo types). Queries are also appropriately handled and built based on the defined CRS.
Since the Maps Application does not yet have the ability to visualize geospatial geometries in anything other than WGS84 Web Mercator a new (prototype) Geometry Reprojection Ingest Processor has been written that enables users to reproject incoming documents to any supported coordinate reference system. In this manner users can index their data in the native coordinate reference system and (if it's not already in WGS84 web mercator) use the ingest processor to reproject and index in a separate field for visualizing in the Maps app.
Snapshot Lifecycle Management (SLM)
SLM retention has been merged to master and 7.x (#46407). Retention for SLM is the next phase for Snapshot Lifecycle Management that allows a user to specify how many snapshots to keep, and for how long.
What about Cloud and SLM ?
Astute readers will notice that SLM is approaching the same capabilities as Cloud's snapshotting. 7.4, the initial release of SLM, allows users to schedule when snapshots occur and to which repository the snapshot will land. Cloud does not allow users direct access to their snapshots, so users that want more control can push their own snapshots to their own repository. This has always been possible, but required extra tooling (like Curator) to take snapshots on a schedule. With 7.4 users can now natively schedule snapshots. Cloud takes snapshot every 30 minutes, and are usually fairly quick thanks to the incremental nature of snapshots. However, it should be noted that only one snapshot process can be executed in the cluster at any time. There is a potential for Cloud initiated snapshots to collide with SLM (or manual) snapshots. If this happen the first one always wins, and the second one will be in an error state. This isn't ideal, nor is it a new issue (but potentially exacerbated by SLM). The root cause of this potential collision is two different snapshot orchestrations (Cloud vs. SLM) that are not aware of each other.
Retention is the biggest gap in functionality between Cloud snapshotting and SLM. Now that retention is getting stabilized this gap is going away and there is an opportunity for the two snapshot orchestrations to converge to one (SLM). There have/are on-going discussion with Cloud to help ensure that SLM meets Cloud's requirements. Also the UI area is working on a Kibana UI for the retention piece for SLM.
Apache Lucene
Faster parallel search
After lots of discussions with other committers in Lucene, a pr was opened to share information between threads to help skip non-competitive hits in the concurrent search. Today, to avoid contention, each slice (thread) keeps track of their local competitive hits in a priority queue and we merge them when all threads are terminated. This method is very effective but it doesn't take advantage of the fact that a high score on one thread could help skip non-competitive hits on another thread. For this reason we're investigating a way to share the minimum score among threads to maintain a minimum score globally. There are multiple ways to achieve this and we're now leaning to a simple solution that keeps track of the minimum score by picking the maximum bottom score of all priority queues. This global minimum score can then be used locally by threads that didn't reach this level to skip non-competitive hits efficiently.
Faster nearest neighbor search
The old logic used to look at the maximum distance that is required for a hit to be competitive, compute a bounding box from it and search for hits that are within this bounding box. Ignacio optimized this logic to insteadcompute the minimum distance between the point of interest and a cell of the BKD tree, and only look at points within this cell if this minimum distance is competitive. This helps skip a few more cells than the previous approach.
Other
- We made WITHIN and DISJOINT queries faster on shapes.
- It is proposed that we add a CharFilter with the same capabilities as ICUTransformFilter so that dictionary-based tokenization becomes easier, by not having to worry for instance that some terms might contain a mix of simplified and traditional chinese characters.
- We want to move XYRectangle2D to floats instead of doubles since coordinates are internally indexed as floats.
- We are reviving a pull request that adds CONTAINS support to shapes.
Changes in Elasticsearch
Changes in 7.5:
- Handle lower retaining seqno retention lease error #46420
- [ILM] Add date setting to calculate index age #46561
- Geo: fix indexing of west to east linestrings crossing the antimeridian #46601
- SQL: Implement DATE_TRUNC function #46473
- Deprecate _field_names disabling #42854
- Update http-core and http-client dependencies #46549
- Disable local build cache in CI #46505
- Ensure rest api specs are always copied before using test classpath #46514
- Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor (Redux) #46444
- Add retention to Snapshot Lifecycle Management #46407
- Remove trailing comma from nodes lists #46484
- Update mustache dependency to 0.9.6 #46243
- Execute SnapshotsService Error Callback on Generic Thread #46277
- Resolve the incorrect scroll_current when delete or close index #45226
- [ML-DataFrame] improve error message for timeout case in stop #46131
- Deprecate the "index.max_adjacency_matrix_filters" setting #46394
Changes in 7.4:
- Upgrade to Gradle 5.6 #45005
- Add support for tests.jvm.argline in testclusters #46540
- Handle partial failure retrieving segments in SegmentCountStep #46556
- Enforce realm name uniqueness #46253
- Fallback to realm authc if ApiKey fails #46538
- Fix highlighting for script_score query #46507
- HLRC multisearchTemplate forgot params #46492
- Fix SnapshotLifecycleMetadata xcontent serialization #46500
- Fix the JVM we use for bwc nodes #46314
- Ignore replication for noop updates #46458
Changes in 7.3:
- SQL: Use null schema response #46386
- SQL: fix scripting for grouped by datetime functions #46421
- Update field-names-field.asciidoc #46430
Changes in 6.8:
- Fix false positive out of sync warning in synced-flush #46576
- Make reuse of sql test code explicit #45884
- Add more meaningful keystore version mismatch errors #46291
- SQL: Fix issue with common type resolution #46565
- Fix class used to initialize logger in Watcher #46467
- Always rebuild checkpoint tracker for old indices #46340
Changes in Elasticsearch Hadoop Plugin
Changes in 8.0:
Changes in 7.3:
Changes in 7.0:
- [DOCS] Updates location of version attribute for Apache Hadoop Guide #1349
Changes in Elasticsearch SQL ODBC Driver
Changes in 7.5:
- Add dropdown to select data encapsulation format #180
Changes in 7.3:
- Bump version to 7.3.3 #183
Changes in Rally
Changes in 1.4.0:
- Store track-related meta-data in results #771
- Honor ingest-percentage for bulks #768
- Remove merge times from command line report #767
- Run a task completely even without time-periods #763
Changes in 1.3.0:
- BREAKING: Remove MergeParts internal telemetry device #764
Changes in Rally Tracks
- Adjust to new parameter source API #83