Elasticsearch Resiliency Status

Overview
Data Store Recommendations
Work in Progress
Completed

Overview

The team at Elasticsearch is committed to continuously improving both Elasticsearch and Apache Lucene to protect your data. As with any distributed system, Elasticsearch is complex and has many moving parts, each of which can encounter edge cases that require proper handling. Our resiliency project is an ongoing effort to find and fix these edge cases. If you want to keep up with all this project on GitHub, see our issues list under the tag resiliency.

While GitHub is great for sharing our work, it can be difficult to get an overview of the current state of affairs and the previous work that has been done from an issues list. This page provides an overview of all the resiliency-related issues that we are aware of, improvements that have already been made and current in-progress work. We’ve also listed some historical improvements throughout this page to provide the full context.

If you’re interested in more on how we approach ensuring resiliency in Elasticsearch, you may be interested in Igor Motov’s talk Improving Elasticsearch Resiliency.

You may also be interested in our blog post Resiliency in Elasticsearch, which details our thought processes when addressing resiliency in both Elasticsearch and the work our developers do upstream in Apache Lucene.

Elasticsearch Resiliency Status

Overview

Data Store Recommendations

Work in Progress

Known Unknowns (STATUS: ONGOING)

Jepsen Tests

Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)

OOM resiliency (STATUS: ONGOING)

Relocating shards omitted by reporting infrastructure (STATUS: ONGOING)

Documentation of guarantees and handling of failures (STATUS: ONGOING)

Run Jepsen (STATUS: ONGOING)

Completed

Documents indexed during a network partition cannot be uniquely identified (STATUS: DONE, v7.0.0)

Replicas can fall out of sync when a primary shard fails (STATUS: DONE, v7.0.0)

Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)

Divergence between primary and replica shard copies when documents deleted (STATUS: DONE, V6.3.0)

Port Jepsen tests dealing with loss of acknowledged writes to our testing framework (STATUS: DONE, V5.0.0)

Loss of documents during network partition (STATUS: DONE, v5.0.0)

Safe primary relocations (STATUS: DONE, v5.0.0)

Do not allow stale shards to automatically be promoted to primary (STATUS: DONE, v5.0.0)

Make index creation resilient to index closing and full cluster crashes (STATUS: DONE, v5.0.0)

Use two phase commit for Cluster State publishing (STATUS: DONE, v5.0.0)

Wait on incoming joins before electing local node as master (STATUS: DONE, v2.0.0)

Mapping changes should be applied synchronously (STATUS: DONE, v2.0.0)

Add per-segment and per-commit ID to help replication (STATUS: DONE, v2.0.0)

Write index metadata on data nodes where shards allocated (STATUS: DONE, v2.0.0)

Better file distribution with multiple data paths (STATUS: DONE, v2.0.0)

Lucene checksums phase 3 (STATUS: DONE, v2.0.0)

Report shard-level statuses on write operations (STATUS: DONE, v2.0.0)

Take filter cache key size into account (STATUS: DONE, v2.0.0)

Ensure shard state ID is incremental (STATUS: DONE, v1.5.1)

Verification of index UUIDs (STATUS: DONE, v1.5.0)

Disable recovery from known buggy versions (STATUS: DONE, v1.5.0)

Upgrade 3.x segments metadata on engine startup (STATUS: DONE, v1.5.0)

Prevent setting minimum_master_nodes to more than the current node count (STATUS: DONE, v1.5.0)

Simplify and harden shard recovery and allocation (STATUS: DONE, v1.5.0)

Prevent use of known-bad Java versions (STATUS: DONE, v1.5.0)

Make recovery be more resilient to partial network partitions (STATUS: DONE, v1.5.0)

Improving Zen Discovery (STATUS: DONE, v1.4.0.Beta1)

Lucene checksums phase 2 (STATUS:DONE, v1.4.0.Beta1)

Don’t allow unsupported codecs (STATUS: DONE, v1.4.0.Beta1)

Use checksums to identify entire segments (STATUS: DONE, v1.4.0.Beta1)

Fix ''Split Brain can occur even with minimum_master_nodes'' (STATUS: DONE, v1.4.0.Beta1)

Translog Entry Checksum (STATUS: DONE, v1.4.0.Beta1)

Request-Level Memory Circuit Breaker (STATUS: DONE, v1.4.0.Beta1)

Doc Values (STATUS: DONE, v1.4.0.Beta1)

Index corruption when upgrading Lucene 3.x indices (STATUS: DONE, v1.4.0.Beta1)

Improve error handling when deleting files (STATUS: DONE, v1.4.0.Beta1)

Using Lucene Checksums to verify shards during snapshot/restore (STATUS:DONE, v1.3.3)

Rare compression corruption during shard recovery (STATUS: DONE, v1.3.2)

Safer recovery of replica shards (STATUS: DONE, v1.3.0)

Using Lucene Checksums to verify shards during recovery (STATUS: DONE, v1.3.0)

Detect File Corruption (STATUS: DONE, v1.3.0)

Network disconnect events could be lost, causing a zombie node to stay in the cluster state (STATUS: DONE, v1.3.0)

Other fixes to Lucene to address resiliency (STATUS: DONE, v1.3.0)

Backwards Compatibility Testings (STATUS: DONE, v1.3.0)

Full Translog Writes on all Platforms (STATUS: DONE, v1.2.2 and v1.3.0)

Lucene Checksums (STATUS: DONE, v1.2.0)

Detect errors faster by locally failing a shard upon an indexing error (STATUS: DONE, v1.2.0)

Snapshot/Restore API (STATUS: DONE, v1.0.0)

Circuit Breaker: Fielddata (STATUS: DONE, v1.0.0)

Use of Paginated Data Structures to Ease Garbage Collection (STATUS: DONE, v1.0.0 & v1.2.0)

Dedicated Master Nodes Resiliency (STATUS: DONE, v1.0.0)

Multi Data Paths May Falsely Report Corrupt Index (STATUS: DONE, v1.0.0)

Randomized Testing (STATUS: DONE, v1.0.0)

Lucene Loses Data On File Descriptors Failure (STATUS: DONE, v0.90.0)