Giuseppe Santoro

The antidote for index mapping exceptions: ignore_malformed

Ignore fields not compliant with index mappings and avoid dropping documents during ingestion to Elasticsearch®

12 min read
The antidote for index mapping exceptions: ignore_malformed

In this article, I'll explain how the setting ignore_malformed can make the difference between a 100% dropping rate and a 100% success rate, even with ignoring some malformed fields.

As a senior software engineer working at Elastic®, I have been on the first line of support for anything related to Beats or Elastic Agent running on Kubernetes and Cloud Native integrations like Nginx ingress controller.

During my experience, I have seen all sorts of issues. Users have very different requirements. But at some point during their experience, most of them encounter a very common problem with Elasticsearch: index mapping exceptions.

How mappings work

Like any other document-based NoSQL database, Elasticsearch doesn’t force you to provide the document schema (called index mapping or simply mapping) upfront. If you provide a mapping, it will use it. Otherwise, it will infer one from the first document or any subsequent documents that contain new fields.

In reality, the situation is not black and white. You can also provide a partial mapping that covers only some of the fields, like the most common fields, and leave Elasticsearch to figure out the mapping of all the other fields during ingestion with Dynamic Mapping.

What happens when data is malformed?

No matter if you specified a mapping upfront or if Elasticsearch inferred one automatically, Elasticsearch will drop an entire document with just one field that doesn't match the mapping of an index and return an error instead. This is not much different from what happens with other SQL databases or NoSQL data stores with inferred schemas. The reason for this behavior is to prevent malformed data and exceptions at query time.

A problem arises if a user doesn't look at the ingestion logs and misses those errors. They might never figure out that something went wrong, or even worse, Elasticsearch might stop ingesting data entirely if all the subsequent documents are malformed.

The above situation sounds very catastrophic, but it's entirely possible since I have seen it many times when on-call for support or on discuss.elastic.co. The situation is even more likely to happen if you have user-generated documents, so you don't have full control over the quality of your data.

Luckily, there is a setting that not many people know about in Elasticsearch that solves the exact problems above. This field has been there since Elasticsearch 2.0. We are talking ancient history here since the latest version of the stack at the time of writing is Elastic Stack 8.9.0.

Let's now dive into how to use this Elasticsearch feature.

A toy use case

To make it easier to interact with Elasticsearch, I am going to use Kibana® Dev Tools in this tutorial.

The following examples are taken from the official documentation on ignore_malformed. I am here to expand on those examples by providing a few more details about what happens behind the scenes and on how to search for ignored fields. We are going to use the index name my-index, but feel free to change that to whatever you like.

First, we want to create an index mapping with two fields called number_one and number_two. Both fields have type integer, but only one of them has _ ignore_malformed _ set to true, and the other one inherits the default value ignore_malformed: false instead.

PUT my-index
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "ignore_malformed": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }
}

If the mentioned index didn’t exist before and the previous command ran successfully, you should get the following result:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "my-index"
}

To double-check that the above mapping has been created correctly, we can query the newly created index with the command:

GET my-index/_mapping

You should get the following result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

Now we can ingest two sample documents — both invalid:

PUT my-index/_doc/1
{
  "text":       "Some text value",
  "number_one": "foo"
}

PUT my-index/_doc/2
{
  "text":       "Some text value",
  "number_two": "foo"
}

The document with id=1 is correctly ingested, while the document with id=2 fails with the following error. The difference between those two documents is in which field we are trying to ingest a sample string “foo” instead of an integer.

{
  "error": {
    "root_cause": [
      {
        "type": "document_parsing_exception",
        "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'"
      }
    ],
    "type": "document_parsing_exception",
    "reason": "[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'",
    "caused_by": {
      "type": "number_format_exception",
      "reason": "For input string: \"foo\""
    }
  },
  "status": 400
}

Depending on the client used for ingesting your documents, you might get different errors or warnings, but logically the problem is the same. The entire document is not ingested because part of it doesn’t conform with the index mapping. There are too many possible error messages to name, but suffice it to say that malformed data is quite a common problem. And we need a better way to handle it.

Now that at least one document has been ingested, you can try searching with the following query:

GET my-index/_search
{
  "fields": [
    "*"
  ]
}

Here, the parameter fields is required to show the values of those fields that have been ignored. More on this later.

From the result, you can see that only the first document (with id=1) has been ingested correctly while the second document (with id=2) has been completely dropped.

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": null,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        },
        "fields": {
          "text": ["Some text value"],
          "text.keyword": ["Some text value"]
        },
        "ignored_field_values": {
          "number_one": ["foo"]
        },
        "sort": ["1"]
      }
    ]
  }
}

From the above JSON response, you will notice some things, such as:

  • A new field called _ _ignored _ of type array with the list of all fields that have been ignored while ingesting documents
  • A new field called _ ignored_field_values _ with a dictionary of ignored fields and their values
  • The field called __ source _ contains the original document unmodified. This is especially useful if you want to fix the problems with the mapping later.
  • The field called _ text _ was not present in the original mapping, but it is now included since Elasticsearch automatically inferred the type of this field. In fact, if you try to query the mapping of the index _ my-index _ again via the command:
GET my-index/_mapping

You should get this result:

{
  "my-index": {
    "mappings": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        },
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

Finally, if you ingest some valid documents like the following command:

PUT my-index/_doc/3
{
  "text":       "Some text value",
  "number_two": 10
}

You can check how many documents have at least one ignored field with the following Exists query:

GET my-index/_search
{
  "query": {
    "exists": {
      "field": "_ignored"
    }
  }
}

You can also see that out of the two documents ingested (with id=1 and id=3) only the document with id=1 contains an ignored field.

{
  "took": 193,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": 1,
        "_ignored": ["number_one"],
        "_source": {
          "text": "Some text value",
          "number_one": "foo"
        }
      }
    ]
  }
}

Alternatively, you can search for all documents that have a specific field being ignored with this Terms query:

GET my-index/_search
{
  "query": {
    "terms": {
      "_ignored": [ "number_one"]
    }
  }
}

The result, in this case, will be the same as the previous one since we only managed to ingest a single document with that exact single field ignored.

Conclusion

Because we are a big fan of this flag, we've enabled _ ignore_malformed _ by default for all Elastic integrations and in the default index template for logs data streams as of 8.9.0. More information can be found in the official documentation for ignore_malformed.

And since I am personally working on this feature, I can reassure you that it is a game changer.

You can start by setting _ ignore_malformed _ on any cluster manually before Elastic Stack 8.9.0. Or you can use the defaults that we set for you starting from Elastic Stack 8.9.0.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.