This Week In Elasticsearch and Apache Lucene - 2019-09-13

Highlights

Rally 1.3.0

Rally 1.3.0 was released. The most notable change is Rally stores the track and team revision in the metrics store so it is easier to reproduce benchmark results over longer periods of time in case there are changes to tracks or Elasticsearch configurations. With this release support for Elasticsearch 1.x was also dropped.

Packaging

The next major release of macOS will require applications distributed outside the Mac App Store to be signed and notarized. As part of our effort towards signing and notarizing the components of Elasticsearch (the JDK, JNA, ML binaries), we need a signed and notarized JDK. The Oracle OpenJDK that we are using is not signed and notarized, and doesn't appear that it will be for the next major release of the JDK either. AdoptOpenJDK is a newer distribution of OpenJDK that we were in the process of adding support for starting with 7.4.0, and it turns out that AdoptOpenJDK is already signing and notarizing their distributions. Thus we discussed and agreed to switch our bundled jdk to AdoptOpenJDK, and we opened a PR this week.

Snapshots

We enhanced documentation to include support for OneZoneInfrequentAccess storage by the S3 repository plugin.

We investigated issues with restoring large snapshots on Cloud. These failures were caused by S3 closing connections mid-downloads which is treated as a fatal error by the S3 SDK and not retried. We opened a PR to add retries for this scenario on top of the S3 SDK.

We continued work on enhancing our integration tests for Cloud provider backed repositories and opened various PRs to create a mocked GCS endpoint and enhance other mock HTTP endpoints to allow for testing multiple failure scenarios and validating the retry-on-failure behaviour of the cloud provider's SDKs. This work was already used for adding tests on the handling of broken connections during S3 blob downloads.

We continued work on adjusting the snapshot metadata format to speed up snapshots and limit the negative effects of S3's eventual consistency model on the S3 snapshot repository implementation.

G1GC

We investigated real-memory circuit breaker issues and found that nearly all of the reported issues on the Discuss forum are from deployments using G1GC. We tracked down the issue to sub-optimal G1GC defaults in our default jvm.options and opened a PR to adjust them to more aggressively run collections to make sure that the real-memory circuit breaker threshold would not be exceeded in situations of moderate memory-pressure. Also, we conducted extensive investigations into the performance impact of the proposed change to the JVM defaults to validate that they would not degrade performance.

Better storage of _source

We did a quick prototype that consists of storing the schema and the data of documents in separate fields in order to help compression. It seems to already help on geonames but we wanted to conduct more tests, especially on adversarial cases like CSV-style content with random integers. This idea would be complementary with storing top-level JSON fields in different stored fields because, as we noticed last time we discussed source compression, the Elastic common schema stores everything under objects so storing top-level fields under a separate stored field alone wouldn't help much users of the ECS. It relates to a previous discussion on making ingesting faster by enabling users to send a bulk request that contains one schema and data for many documents, which would allow parsing field names and looking up field mappers only once for entire batches of documents.

UI: Giving hugs by fixing bugs

The ES UI team dedicated the first week of the release cycle to fixing some of our top bugs. We'll make this a standard practice for each release to keep the bug backlog from growing too large. This week we fixed the following:

Pivot in SQL

We introduced PIVOT-ing in SQL

A popular transformation function, this creates a statistics table around the pivoting column, being quite popular in BI tools (one of the most voted issues in Kibana is around pivoting: https://github.com/elastic/kibana/issues/5049):

SELECT * FROM (SELECT browser, request_bytes, country FROM logs) PIVOT (AVG(request_bytes) FOR country IN ('NL', 'US', 'RO')) browser | NL | US | RO ------------+-----------------+-----------------+-----------------+ Chrome |48396.28571428572|53216.28571428572|78353.39941364497| IE |47058.90909090909|34698.31544913364|85082.12264984912| Mozilla |49767.22342622222|46463.87613649497|43761.97631565941| Other |44103.90909090290|44323.10345673210|22134.11231234329|

Geo

We opened a PR for adding support for the new xy (cartesian) shape fields to SQL. We also started working on fixing geo_shape edge-case bugs that were uncovered during geo_shape/shape refactoring, fixing handling of west to east linestring, and working on fixing handling of very long linestrings

We completed the initial implementation for adding spatial projection support as an XPack feature extension to the open source geo_shape field type. The following mapping example demonstrates a use case for indexing geospatial geometries projected in UTM Zone 14N using the GeoJSON Coordinate Reference System (CRS) Format.

"properties" : { "location" : { 'type" : "geo_shape", "crs": { "type": "name", "properties": { "name": "EPSG:32614" // UTM Zone 14N } } } }

The incoming CRS is handled in the X-Pack Spatial extension plugin and indexes the incoming UTM Geometries using lucene's new XYShape field without having to reproject to WGS84 (a previous requirement for all geo types). Queries are also appropriately handled and built based on the defined CRS.

Since the Maps Application does not yet have the ability to visualize geospatial geometries in anything other than WGS84 Web Mercator a new (prototype) Geometry Reprojection Ingest Processor has been written that enables users to reproject incoming documents to any supported coordinate reference system. In this manner users can index their data in the native coordinate reference system and (if it's not already in WGS84 web mercator) use the ingest processor to reproject and index in a separate field for visualizing in the Maps app.

Snapshot Lifecycle Management (SLM)

SLM retention has been merged to master and 7.x (#46407). Retention for SLM is the next phase for Snapshot Lifecycle Management that allows a user to specify how many snapshots to keep, and for how long.

What about Cloud and SLM ?

Astute readers will notice that SLM is approaching the same capabilities as Cloud's snapshotting. 7.4, the initial release of SLM, allows users to schedule when snapshots occur and to which repository the snapshot will land. Cloud does not allow users direct access to their snapshots, so users that want more control can push their own snapshots to their own repository. This has always been possible, but required extra tooling (like Curator) to take snapshots on a schedule. With 7.4 users can now natively schedule snapshots. Cloud takes snapshot every 30 minutes, and are usually fairly quick thanks to the incremental nature of snapshots. However, it should be noted that only one snapshot process can be executed in the cluster at any time. There is a potential for Cloud initiated snapshots to collide with SLM (or manual) snapshots. If this happen the first one always wins, and the second one will be in an error state. This isn't ideal, nor is it a new issue (but potentially exacerbated by SLM). The root cause of this potential collision is two different snapshot orchestrations (Cloud vs. SLM) that are not aware of each other.

Retention is the biggest gap in functionality between Cloud snapshotting and SLM. Now that retention is getting stabilized this gap is going away and there is an opportunity for the two snapshot orchestrations to converge to one (SLM). There have/are on-going discussion with Cloud to help ensure that SLM meets Cloud's requirements. Also the UI area is working on a Kibana UI for the retention piece for SLM.

Apache Lucene

Faster parallel search

After lots of discussions with other committers in Lucene,  a pr was opened to share information between threads to help skip non-competitive hits in the concurrent search. Today, to avoid contention, each slice (thread) keeps track of their local competitive hits in a priority queue and we merge them when all threads are terminated. This method is very effective but it doesn't take advantage of the fact that a high score on one thread could help skip non-competitive hits on another thread. For this reason we're investigating a way to share the minimum score among threads to maintain a minimum score globally. There are multiple ways to achieve this and we're now leaning to a simple solution that keeps track of the minimum score by picking the maximum bottom score of all priority queues. This global minimum score can then be used locally by threads that didn't reach this level to skip non-competitive hits efficiently.

Faster nearest neighbor search

The old logic used to look at the maximum distance that is required for a hit to be competitive, compute a bounding box from it and search for hits that are within this bounding box. Ignacio optimized this logic to insteadcompute the minimum distance between the point of interest and a cell of the BKD tree, and only look at points within this cell if this minimum distance is competitive. This helps skip a few more cells than the previous approach.

Other

Changes in Elasticsearch

Changes in 7.5:

  • Handle lower retaining seqno retention lease error #46420
  • [ILM] Add date setting to calculate index age #46561
  • Geo: fix indexing of west to east linestrings crossing the antimeridian #46601
  • SQL: Implement DATE_TRUNC function #46473
  • Deprecate _field_names disabling #42854
  • Update http-core and http-client dependencies #46549
  • Disable local build cache in CI #46505
  • Ensure rest api specs are always copied before using test classpath #46514
  • Refactor AllocatedPersistentTask#init(), move rollup logic out of ctor (Redux) #46444
  • Add retention to Snapshot Lifecycle Management #46407
  • Remove trailing comma from nodes lists #46484
  • Update mustache dependency to 0.9.6 #46243
  • Execute SnapshotsService Error Callback on Generic Thread #46277
  • Resolve the incorrect scroll_current when delete or close index #45226
  • [ML-DataFrame] improve error message for timeout case in stop #46131
  • Deprecate the "index.max_adjacency_matrix_filters" setting #46394

Changes in 7.4:

  • Upgrade to Gradle 5.6 #45005
  • Add support for tests.jvm.argline in testclusters #46540
  • Handle partial failure retrieving segments in SegmentCountStep #46556
  • Enforce realm name uniqueness #46253
  • Fallback to realm authc if ApiKey fails #46538
  • Fix highlighting for script_score query #46507
  • HLRC multisearchTemplate forgot params #46492
  • Fix SnapshotLifecycleMetadata xcontent serialization #46500
  • Fix the JVM we use for bwc nodes #46314
  • Ignore replication for noop updates #46458

Changes in 7.3:

  • SQL: Use null schema response #46386
  • SQL: fix scripting for grouped by datetime functions #46421
  • Update field-names-field.asciidoc #46430

Changes in 6.8:

  • Fix false positive out of sync warning in synced-flush #46576
  • Make reuse of sql test code explicit #45884
  • Add more meaningful keystore version mismatch errors #46291
  • SQL: Fix issue with common type resolution #46565
  • Fix class used to initialize logger in Watcher #46467
  • Always rebuild checkpoint tracker for old indices #46340

Changes in Elasticsearch Hadoop Plugin

Changes in 8.0:

  • BREAKING: Remove Scala 2.10 Support #1350
  • Upgrade to and fix compatibility with Gradle 5.5 #1333

Changes in 7.3:

  • [DOCS] Add 7.3.2 release notes #1352
  • [DOCS] Bump docs version from 7.3.1 to 7.3.2 #1353

Changes in 7.0:

  • [DOCS] Updates location of version attribute for Apache Hadoop Guide #1349

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.5:

  • Add dropdown to select data encapsulation format #180

Changes in 7.3:

  • Bump version to 7.3.3 #183

Changes in Rally

Changes in 1.4.0:

  • Store track-related meta-data in results #771
  • Honor ingest-percentage for bulks #768
  • Remove merge times from command line report #767
  • Run a task completely even without time-periods #763

Changes in 1.3.0:

  • BREAKING: Remove MergeParts internal telemetry device #764

Changes in Rally Tracks

  • Adjust to new parameter source API #83