This Week in Elasticsearch and Apache Lucene - 2019-07-26
We continued work on cancelling searches focusing on concurrency. We fixed several issues that prevented the channel from being released properly in the REST layer. We also added several tests to ensure that we're not leaking objects while checking whether channels are closed.
We finished implementing the functionality needed to handle results pinning in the query DSL. We are now adding the feature to the high level rest client in order to make it available for Java users.
We merged a PR to use
_source whenever possible when extracting field values in SQL. Retrieving field values from doc values is limited by default in Elasticsearch, so we need to use
_source as much as possible to avoid hitting the limit.
Simplifying peer recovery retention leases
Recovering a new or existing replica is done through peer recovery, the process of syncing the content of the replica with the primary shard copy. Depending on the data that already exists on the replica and the data that's available on the primary, this can be an incremental process. Operation-based (a.k.a. sequence-number based) recoveries allow a replica to catch up to the primary by replaying the history of operations that the replica missed while it was offline. This, however, requires the primary to retain said history in order to enable operation-based recoveries. Retention and replay of the history for operation-based recoveries is currently done by the translog, controlled through time- and size-based retention policies.
Replicas can also have operations that do not exist on the primary. Imagine a scenario where you had a primary with two replicas, and an operation was replicated to one of the replicas but not yet to the other one. With concurrent out-of-order replication, replicas can contain disjoint sets of operations at a given moment in time. If the primary crashes at that point, independently of which replica is promoted as primary, it might not have a superset of the operations on the replica. In order to allow the replica to be efficiently realigned with the primary through operation-based recovery (i.e. copying over the missing operations), Elasticsearch uses the notion of a safe commit. It is a durable (Lucene) snapshot of the data which only contains operations that are available on all in-sync shard copies, i.e., those shard copies that contain all acknowledged writes. As the primary is only chosen among the in-sync shard copies, a safe commit has therefore a subset of the operations on the primary. A replica will use its safe commit as a base state to realign itself with the primary. Safe commits are not explicitly created by the system, but are promoted out of regular Lucene commit points once a replica learns that the global checkpoint has advanced past all operations in the commit point. The global checkpoint is a safety marker, denoting that all operations up to this sequence number are durably stored on all in-sync shard copies. The safe commit also has an associated local checkpoint that denotes that all operations up to this sequence number have been applied to this Lucene snapshot. The snapshot might contain additional operations, however, but those are guaranteed to be below the replica's knowledge of the global checkpoint.
Today, when initiating a peer recovery, the replica uses the local checkpoint of the safe commit as starting point to request missing operations from the primary. The primary checks if, with the given starting sequence number, it can serve the full range of operations from its translog, depending purely on time- and size-based retention policies. With peer-recovery retention leases, we would like to track retention on a per-need basis. This is achieved by creating a retention lease in the system for each shard copy that tracks its starting sequence number that it would use in case where it had to initiate an operation-based peer recovery. Shards would then use these leases to decide on how much history to retain to enable operation-based recoveries of other shard copies. As the starting sequence number is based on the local checkpoint of the safe commit, this information would now need to be periodically exchanged in the system to update the retention leases.
We substantially simplified this system by adapting peer recoveries to recover replicas locally up to their knowledge of the global checkpoint first before performing peer recovery. Peer recoveries then use the global checkpoint as a starting sequence number. With this change, there is no need to broadcast information about the local checkpoint of the safe commit between shard copies. Instead, retention leases are updated based on the replicas knowledge of the global checkpoint, which is already tracked in the system to ensure that each replica has up-to-date information about the global checkpoint.
We opened a number of PRs to optimise memory usage in our transport implementations. This ranges from optimizing Netty frame decoding to removing unused fields as well as other optimizations for both StreamInput and StreamOutput classes.
We adapted the remote cluster connection facilities to asynchronously connect to remote clusters. This was the last missing piece to close out the resilience meta issue to avoid blocking threads waiting for connections.
SAML usability improvements
The Kibana team recently implemented a change to make it possible to explicitly configure the name of the Elasticsearch SAML realm in Kibana.
When a Kibana user attempts to authenticate via SAML, we need to decide which SAML realm to use. Until now Kibana indicated that by providing the URL on which it was listening. This works fine when Kibana is on localhost with the standard port, but would require very precise configuration for more complex setups, such as when Kibana was behind a proxy. Now, users just need to provide Kibana with the name of their ES SAML realm.
We also see a lot of SAML setup problems when there is a mismatch between the Entity ID configured in Elasticsearch and the Identity Provider (such as Okta). SAML uses URIs as identifiers, but people often think of them as URLs and assume that IDs such as
<http://example.com/> will be considered equal. We improved the error message and logging for this scenario to try and make it easier to understand what's going on.
We opened a PR to refactor GeoShapeQueryBuilder to derive from a new AbstractGeometryQueryBuilder class. This enables the upcoming XYShape query builders in the spatial xpack module to reuse the geometry parsing and query building logic currently exclusive to GeoShapeQueryBuilder.
We moved geo shape generation methods into the test framework and switched it to Lucene's more robust GeoTestUtil. This should allow us to generate realistic random geoshapes in other tests.
We fixed a bug in the auto interval date histo aggregation that was causing an incorrect number of buckets to be returned.
CLI tool JVM options
The CLI tools included with Elasticsearch (e.g. plugin or keystore tools) run with default JVM settings. Normally, a user would not run these tools at the same time as Elasticsearch, so those settings have not been a problem. However, with the introduction of the reload secure settings API, the keystore tool is now expected to be run while Elasticsearch is running. This can cause issues in container environments with restricted memory.
We introduced explicit JVM options for our CLI tools. In particular, the heap now starts at 4mb and maxes at 64mb. The
ES_JAVA_OPTS environment variable is respected, so they can be overriden if necessary. This also exposed an issue with how the plugin CLI install command loads the entire plugin into memory in order to calculate a checksum for verification, which we fixed by buffering the plugin contents.
When Elasticsearch is started as a systemd service, the system assumes it is ready immediately after forking. This can cause problems for downstream services which are launched before Elasticsearch is actually ready for requests. Systemd provides a mechanism,
sd_notify, for services to notify the system when they are ready. We added a module to our rpm and deb packages which invokes the sd_notify after Elasticsearch has finished starting up.
Refactoring: Streamable is gone
We have completed the removal of Streamable from the codebase, the culmination of a multi-year effort. All classes implementing Writeable now have a constructor that takes StreamInput. This was also completely backported to 7.4 to ease other backports.
Index templates UI
Alison submitted a PR that adds the details panel to the Index Templates UI. This continues the work she started a few weeks ago that added the list of Index Templates as a tab in the Index Management app in Kibana. Index Templates is a set of Elasticsearch APIs that allow users to automatically apply settings and mappings to new indices when they're created.
Snapshot and Restore UI
Jen submitted a PR that adds the ability for users to delete and execute their Snapshot Lifecycle Management (SLM) policies from the UI. This continues her work from last week that added a list of SLM policies and detail panel to the Snapshot and Restore app in Kibana. SLM policies allow users to automatically create backups of their data on a predefined schedule, but taking snapshots outside of that schedule is useful as well. By executing a policy, users can take a snapshot right away without having to redefine their configuration.
Lucene 8.2 is out
Lucene 8.2 was just announced and Elasticsearch is already on it. This release notably features a stemmer for Estonian, support for indexing cartesian geometries, better handling of duplicates with points, and faster conjunctions and phrase queries in some cases.
Better splitting in KD trees
Our support for BKD trees makes the assumption that all dimensions are independent, which is not always true, especially when used to index range fields or geo shapes. Removing this assumption could make some queries faster, but it always makes indexing significantly slower, so we need to discuss the trade-offs.
Off-heap index for KD trees
As you might know, Lucene recently gained the ability to move the terms index off-heap. We are now exploring doing the same with the index of our KD trees. Benchmarks suggest that moving the index off-heap doesn't cause any performance penalty, so we might load it off-heap all the time and save memory for all indices.
How to organize spatial-related code is proving controversial: should we have everything in a module or the main features in core and more esoteric features in a module?
JapaneseTokenizer allows corrupt user dictionaries, where concatenating segments doesn't give back the original surface form. This sometimes triggers errors at index time or when parsing synonyms.
DisjunctionMaxQuery now better leverages impacts for faster retrieval of the top hits.
When collecting multiple slices in parallel, could we still share some information between collectors so that they can reduce the amount of work they need to do?
Changes in Elasticsearch
Changes in 8.0:
- Fix cat recovery display of bytes fields #40379
- Fix stats in slow logs to be a escaped JSON #44642
- Deprecation messages with the same key but different x-opaque-id are allowed (#44587) #44587
- Remove support for old translog checkpoint formats #44272
Changes in 7.4:
- Add option to filter ILM explain response #44777
- Asynchronously connect to remote clusters #44825
- Add missing ZonedDateTime methods for joda compat layer #44829
- Add Clone Index API #44267
- Fix an NPE when requesting inner hits and _source is disabled. #44836
- SecurityIndexManager handle RuntimeException while reading mapping #44409
- [ML-DataFrame] Remove ID field from data frame indexer stats #44768
- [Geo] Refactor GeoShapeQueryBuilder to derive from AbstractGeometryQueryBuilder #44780
- Convert logging related gradle classes to java #44771
- Geo: deprecate ShapeBuilder in QueryBuilders #44715
- Fix URL documentation in API specs #44487
- SQL: use hasValue() methods from Elasticsearch's InspectionHelper classes #44745
- Switch from using docvalue_fields to extracting values from _source #44062
- Add CloseIndexResponse to HLRC #44349
- Remove stale permissions from untrusted policy #44783
- Notify systemd when Elasticsearch is ready #44673
- BREAKING: [ML] Improve response format of data frame stats endpoint #44350
- Support BucketScript paths of type string and array. #44694
- [ML][Data Frame] adding force delete #44590
- [ML][Data Frame] Add optional defer_validation param to PUT #44455
- Allow pending tasks before state recovery #44685
- Fix types field in JSON Search Slow Logs #44641
- Removal Streamable #44647
- Convert Transport Request/Response to Writeable #44636
- Convert direct implementations of Streamable to Writeable #44605
- [ML][Data Frame] adding dynamic cluster setting for failure retries #44577
- Remove StreamOutput #writeOptionalStreamable and #writeStreamableList #44602
- Convert remaining Action Response/Request to writeable.reader #44528
- Return seq_no and primary_term for noop update #44603
- Add types field to JSON slow logs in 7.x #44592
- Convert several direct uses of Streamable to Writeable #44586
Changes in 7.3:
- Fix issue with Gradle daemons hanging indefinitely on shutdown #44867
- Ignore unknown fields if overriding node metadata #44689
Changes in 7.2:
- Restore setting up temp dir for windows service #44541
Changes in 6.8:
- SQL: fix URI path being lost in case of hosted ES scenario #44776
- Clean up ShardFollowTasks for deleted indices #44702
- Add default CLI JVM options #44545
- Use system context for looking up connected nodes #43991
- Check shard limit after applying index templates #44619
- Do not checksum all bytes at once in plugin install #44649
Changes in Elasticsearch Hadoop Plugin
Changes in 7.4:
- Fix Lucene's snapshot version #1320
Changes in 7.3:
- [DOCS] Update anchors and links for Elasticserach API relocation. #1322
Changes in Rally
Changes in 1.3.0: