08 August 2016 Engineering

Elasticsearch: Verifying Data Integrity with External Data Stores

By Chris Earle

Elasticsearch is sometimes used alongside other databases. In those scenarios, it's often hard to implement solutions surrounding two-phase commits due to the lack of transaction support across all systems in play. Depending on your use case, it may or may not be necessary to verify the data exists in both data stores, where one serves as the so-called "source of truth" for the other.

As Support Engineers at Elastic, we frequently see requests that ask things like the best way to structure your data for verification as well as simply how to do verification efficiently. This blog post will walk through a few examples that we have seen and helped to create, ranging from simple to advanced, that verify Elasticsearch contains the necessary data from a database like PostgreSQL.

Modeling Data for Verification

The way that your data is stored, both in and out of Elasticsearch, can make a big difference in the difficulty of verification. The first thing that you need to decide is how much needs to be verified:

  • Is the mere existence of a document enough?
  • Or does an internal process require you to verify entire documents?

That decision has ramifications on the amount of effort required to perform verification.

Evolution of Verifying Existence

Fortunately this is not a deep, philosophical question. Instead, it's simply the question: "do all of the documents exist in Elasticsearch?"

Elasticsearch offers a lot of ways to verify existence, as long as you fully understand what is happening. Remember that Elasticsearch searches in near real time, but it can get documents directly in real time. This means that, right after indexing a document, it may not be visible to searches, but it will be available to any direct get request.

One By One

For small scale deployments, the simplest approach is to perform HEAD requests against individual documents, then confirm that the HTTP response code is not 404 (page — or document in this case — not found).

HEAD /my_index/my_type/my_id1

An example response would be just headers for success:

HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

And for non-existence, it looks practically the same, minus the response code:

HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

This requires that you perform N requests against an index expected to have N documents. As you might expect, that does not scale very well.

Batch Processing

The next thing that you might choose to do is to perform these in batch requests using Multi GET API: _mget.

GET /my_index/my_type/_mget
{
  "ids": [ "my_id1", "my_id2" ]
}

Another equivalent approach to this is to simply search for the IDs and return the expected number.

GET /my_index/_refresh
GET /my_index/my_type/_search
{
  "size": 2,
  "query": {
    "ids": {
      "values": [ "my_id1", "my_id2" ]
    }
  }
}

First, the _refresh endpoint is invoked to ensure that everything indexed is searchable. This removes the "near real time search" aspect of searching. Under normal searching conditions, you should not be doing that, but it makes total sense here! Presumably your verification processing happens all at once, so calling _refresh once at the start of a job is sufficient for the rest of it as long as you're not checking for new data that comes in after the process starts.

That offers a superior request handling, but it does not avoid the background work, which remains unchanged, and it means that each document is going to be returned as well, which means added overhead.

Search, Then Batch

From batch processing, people generally branch off to try to beat the problem by subdividing it. First, they will often perform a search to determine what is missing and only then begin digging for the missing document:

GET /my_index/_refresh
GET /my_index/my_type/_search
{
  "size": 0,
  "query": {
    "ids": {
      "values": [ "my_id1", "my_id2" ]
    }
  }
}

If you check the response's hits.total value, then it should match the number that you expect. In this simple example, it should be 2. If it's not 2, then you need to switch gears and either:

  1. Search for all of the IDs, and actually return them by specifying an appropriate size, then find the missing needle(s) in the haystack.
  2. Fall back to the _mget example above and do the same thing: determine what ID is missing.

There Must Be A Better Way

The long-winded or two-step processing can be a bit messy at scale, and it really implies that you're asking an expensive question, but also that you're perhaps asking the wrong question.

What if you flipped the query or question on its head and, instead of effectively querying for "find everything that exists", you asked "what is missing?" It is hard to query for "what is not here" and, in fact, it's downright impossible because the data does not exist. However, we can use aggregations to answer the question if you can structure data conveniently.

Most verification use cases come from an SQL world, where integer-based keys are extremely common for primary keys and foreign keys. Even if you do not use those numeric IDs as your Elasticsearch _id, you should be indexing those as integer-based values with doc values enabled (on by default in ES 2.x and later). For example:

POST /my_index/my_type/my_id1
{
  "id": 1,
  …
}

The name of the field is irrelevant, but do note that it's not _id, which is a reserved metadata field in Elasticsearch. Once you start indexing that value, then finding missing data becomes very simple with histograms:

GET /my_index/my_type/_search
{
  "size": 0,
  "aggs": {
    "find_missing_ids": {
      "histogram": {
        "field": "id",
        "interval": 1,
        "min_doc_count": 0
      },
      "aggs": {
        "remove_existing_bucket_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "_count"
            },
            "script": {
              "inline": "count == 0",
              "lang": "expression"
            }
          }
        }
      }
    }
  }
}

This does two things:

  1. It performs a histogram across the numeric id field with a step of 1 between each value, meaning that for all ids, it will find every integer-based value between them (e.g., 1 - 5 would be a histogram of 1, 2, 3, 4, 5).
  2. It removes any histogram bucket that actually contains data by using a bucket selector, which only exists in Elasticsearch 2.0 and later.

By removing what exists, we're left with what does not exist which might look something like this:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "find_missing_ids": {
      "buckets": [
        {
          "key": 4,
          "doc_count": 0
        }
      ]
    }
  }
}

From a tiny index with 5 documents, known thanks to hits.total, this shows that you are missing the document whose id is 4.

That's all there is to it: if you can create a histogram from the unique key of the document, then you can create one that only shows missing data. You cannot do this with strings, which means things like "my_id" and UUIDs (e.g., a GUID) cannot use this approach with aggregations! It comes back to structuring your data for this type of check.

Mind The Gap

For those not very familiar with sequential identifiers, it is not uncommon for systems (e.g., SQL databases) to prefetch batches of identifiers. Within those batches, it's possible for some values to get skipped for many reasons, such as a failed transaction or even for simpler reasons like the record being deleted explicitly afterward.

In such cases, you need to be aware that those will appear as missing documents according to the histogram. To avoid them, as the client, you can tell Elasticsearch to explicitly ignore those values by adding a secondary bucket selector that eliminates those ids explicitly, or ignore them on the client side.

Verifying Entire Documents

The verification of entire documents is a different beast from simple existence checking. In this scenario, entire documents need to be verified to ensure that the data contained is the expected data. From the Elasticsearch standpoint, it's actually easier, but the work on your side is more difficult.

Scrolling Through Data

The simplest way to perform this check is to scroll through it. Fortunately, this does not mean piping all of your data into an HTML table and using your mouse wheel to double check everything. No, it means using the _scroll API to iterate across all of your data (_doc, used below, is described in that link). Like every search API, it also works on the same near real time principles noted above.

GET /my_index/my_type/_search?scroll=1m
{
  "sort": [
    "_doc"
  ]
}

The scroll time is the amount of time that you, as client software, need to process that batch before requesting the next batch. Make sure that it's long enough to be able to loop through the subsequent response and read through the linked documentation above!

By scrolling through the data, you can go document-by-document and verify all of the data is correct per your requirements.

Taking Control of Versioning

Elasticsearch supports optimistic concurrency control, which is another way to say versioning.

You can take complete control of versioning by supplying your own version number. If you do this, then you can possibly side-step full document verification by verifying version numbers. To enable version numbers in search responses, you must specify the version flag:

GET /my_index/my_type/_search
{
  "version": true
}

Side-Stepping Verification

It is possible that you can side-step verification altogether, if you can trust that the data was only provided by a trusted user and any ingestion failures were properly handled (e.g., if Elasticsearch was down when document 15123 should have been indexed, then something still needs to add it). In that scenario, then verification can become redundant.

X-Pack Security in 5.0 and Shield in earlier releases can provide the security aspect to give trusted users access and block untrusted users, but it is up to you to properly handle ingestion failure because it completely depends on what is doing the ingestion.

Fin

You have done it! You made it through my long blog post.

I hope that you enjoyed reading about data verification and how structuring your data can lead to dramatically simplified approaches to doing it. Thinking about this kind of approach, you can start to see how other problems can be solved by thinking about the query or data differently, alongside rich features like aggregations or X-Pack Security.

We are constantly looking for creative ways to solve interesting problems, and this is no exception. As always, we encourage our users to discuss these issues on our forum, and open GitHub issues for any issues that you may come across and we are also available on Twitter (@pickypg for me) and IRC to help you out!