Canonical, Elastic, and Google team up to prevent data corruption in Linux | Elastic Blog
Engineering

Canonical, Elastic, and Google team up to prevent data corruption in Linux

At Elastic we are constantly innovating and releasing new features. As we release new features we are also working to make sure that they are tested, solid, and reliable — and sometimes we do find bugs or other issues.

While testing a new feature we discovered a Linux kernel bug affecting SSD disks on certain Linux kernels. In this blog article we cover the story around the investigation and how it involved a great collaboration with two close partners, Google Cloud and Canonical.

The investigation resulted in releases of new Ubuntu kernels addressing the issue.

How it all started

Back in February 2019, we ran a 10-day stability test for cross-cluster replication (CCR) with the standard Elasticsearch benchmarking tool, Rally, using an update-heavy, ingest workload. The Elasticsearch version was the latest commit from the unreleased (back then) 7.0 branch. After a number of days the test failed with the Elasticsearch logs reporting:

… 
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?)

Upon detecting[1] an index corruption, Elasticsearch will attempt to recover from a replica. In many cases the recovery is seamless and, unless the user reviews the logs, the corruption will go unnoticed. Even with the redundancy offered by replicas, data loss is possible under certain circumstances. Therefore, we wanted to understand why the corruption was occurring and explore how we could remedy the issue.

At first, we suspected that a new Elasticsearch feature, CCR/soft deletes, was causing the issue. This suspicion was quickly dismissed, however, when we found that we could reproduce the index corruption using a single cluster and disabling the CCR/soft deletes feature. After additional research, we found that we could reproduce the issue in prior versions of Elasticsearch that preceded the beta launch of CCR/soft deletes, suggesting that CCR/soft deletes were not the cause of the index corruption.

We did note, however, that the testing and research that resulted in the index corruption had been conducted on Google Cloud Platform (GCP) using the Canonical Ubuntu Xenial image and local SSDs with 4.15-0-*-gcp kernel and that we could not reproduce the problem on a bare-metal environment with the same operating system and SSD disks (either using 4.13 or 4.15 HWE kernels).

Our attention immediately turned to Elastic Cloud to get more data about the problem and its impact. In a parallel work stream we ruled out a number of suspicions by testing different filesystems, disabling RAID, and finally using a newer mainline kernel, none of which made the corruptions go away.

Digging further with Google Cloud Support

At this point, we created a set of simple reproduction scripts to ease reproductions and engaged our partner Google to start looking deeper into environmental issues.

GCP support was able to reliably reproduce the issue using the provided scripts, with an example error message:

org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file 
truncated?): actual footer=-2016340526 vs expected footer=-1071082520 
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/home/dl/es/data/nodes/0/indices/IF1vmFH6RY-MZuNfx2IO4w/0/index/_iya_1e_Lucene80_0.dvd")))

During the timeframe when the issue occurred, IO throughput went above 160MB/s with 16,000 IOPS. This observation was consistent over several tests. As Elasticsearch (by default) uses memory-mapped files for some of its storage, we suspected that some file accesses were causing more major page faults, resulting in an increase in IO operations to disk and ultimately triggering the issue. To reduce the occurrence of a page fault, we tried increasing the memory of the GCP instances from 32GB to 52GB. With the memory increase, the issue no longer occurred and the IO throughput was at 50MB/s with 4000 IOPS. 

The first breakthrough happened when we observed a difference between GCP and the bare-metal environment: The GCP kernel had a feature called multi-queue block layer (blk_mq) enabled, which the bare metal kernel did not[2]. To complicate matters further, after a certain version[3] it’s not possible to disable blk_mq[4] any more on the Ubuntu Linux -gcp image via kernel options. GCP Support showed us how to disable blk_mq by exporting GCP’s Ubuntu image and re-creating it without the VIRTIO_SCSI_MULTIQUEUE guestOS feature, which enables multi-queue SCSI[5].

The second breakthrough was managing to reproduce the corruption on bare metal: it became possible only by explicitly enabling blk_mq even using an older kernel: 4.13.0-38-generic. We also verified that NVMe disks don’t display this problem.

At this point we knew that the corruptions happen when both of the following two conditions are met:

  • SSD drives using the SCSI interface (NVMe disks are unaffected)
  • blk_mq enabled

GCP Support shared two additional workarounds (on top of using only NVMe disks): increasing instance memory or creating custom instance images with multi-queue SCSI disabled.

Canonical joins the effort

While we had some workarounds, we weren’t satisfied yet:

  • The issue wasn’t specific to Ubuntu images on GCP; it also happened on bare metal.
  • We didn’t know which kernel commit introduced the issue.
  • We didn’t know if a fix was already available in a newer kernel.

To address these points, we reached out to our partner Canonical to dig in a bit deeper.

Canonical started a large testing effort using the Elastic reproduction scripts, first confirming that the corruption did not occur on Ubuntu mainline kernels >=5.0 using SSD drives (using either none or mq-deadline multi-queue I/O schedulers).

The next step was going backwards in kernel versions to find the minimum delta between a kernel that exhibits corruption and one that doesn’t. Using multiple parallel test environments (since a whole test run can take up to five days), Canonical found out that 4.19.8 is the first Ubuntu mainline kernel that includes the corruption fixes[6].

The missing backports for the 4.15.0 kernel and derivatives are described in Canonical’s bug tracker under LP#1848739, and more details can be found in this article and the kernel.org bug.

After Elastic and Canonical confirmed that a patched GCP kernel including all the necessary backports fixes the issue, they were merged into the Ubuntu 4.15.0 main kernel and consequently all derivative kernels (including -gcp) received the fixes.

Conclusion

Elastic is committed to developing new Elastic Stack features that improve each of our three primary solutions. These efforts are supported by some seriously talented engineers and partners that are always vigilant so you don’t have to worry. If and when we do find issues during testing, know that Elastic and its network of close partners will leave no stone left unturned to ensure you have the best possible experience.

Through our close collaboration with Google and Canonical, we were able to get to the bottom of the issue, which led to the release of the following fixed HWE Ubuntu kernels:

  • AWS: Starting with linux-aws - 4.15.0-1061.65 released on Feb 21, 2020
  • Azure: Starting with linux-azure - 4.15.0-1066.71 released on Jan 6, 2020
  • GCP: Starting with linux-gcp - 4.15.0-1053.57 released on Feb 5, 2020
  • Generic: Starting with linux - 4.15.0-88.88 released on Feb 17, 2020

Using the above or newer versions will avoid corruptions when SSD disks are used together with enabled SCSI blk-mq.

If you don’t want to worry whether your environment is protected from this data corruption, then give our Elastic Cloud a spin — our users are already protected.

Footnotes

[1] Elasticsearch does not verify checksums all the time, as it’s an expensive operation. Certain actions may trigger checksum verification more frequently, such as shard relocation or while taking a snapshot, making underlying silent corruptions appear as if they were caused by these actions.

[2] To check whether blk_mq is in use for SCSI or device mapper disks, use cat /sys/module/{scsi_mod,dm_mod}/parameters/use_blk_mq

[3] After https://patchwork.kernel.org/patch/10198305 blk_mq is forced for SCSI devices and can’t be disabled via kernel options. This patch got backported on Ubuntu linux-gcp.

[4] To disable blk_mq the following parameters need to be passed to the kernel e.g. via grub: GRUB_CMDLINE_LINUX="scsi_mod.use_blk_mq=N dm_mod.use_blk_mq=N". Enabling can be done by setting these options to N, but beware of [3].

[5] Example gcloud commands to disable VIRTIO_SCSI_MULTIQUEUE guestOS feature by recreating the Ubuntu image:

# gcloud compute images export --image-project ubuntu-os-cloud --image-family ubuntu-1604-lts --destination-uri=gs://YOUR_BUCKET/ubuntu_16.04.tar.gz 
# gcloud compute images create ubuntu-1604-test --family ubuntu-1604-lts --source-uri=gs://YOUR_BUCKET/ubuntu_16.04.tar.gz

[6] Backports

    - blk-mq: quiesce queue during switching io sched and updating nr_requests 
    - blk-mq: move hctx lock/unlock into a helper 
    - blk-mq: factor out a few helpers from __blk_mq_try_issue_directly 
    - blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback 
    - dm mpath: fix missing call of path selector type->end_io 
    - blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request 
    - blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy 
    - blk-mq: introduce BLK_STS_DEV_RESOURCE 
    - blk-mq: Rename blk_mq_request_direct_issue() into 
      blk_mq_request_issue_directly() 
    - blk-mq: don't queue more if we get a busy return 
    - blk-mq: dequeue request one by one from sw queue if hctx is busy 
    - blk-mq: issue directly if hw queue isn't busy in case of 'none' 
    - blk-mq: fix corruption with direct issue 
    - blk-mq: fail the request in case issue failure 
    - blk-mq: punt failed direct issue to dispatch list