29 Juni 2015 Engineering

The Great Mapping Refactoring

Von Clinton Gormley

One of the biggest sources of pain for users of Elasticsearch today is the ambiguity that exists in type and field mappings. This ambiguity can result in exceptions at index time, exceptions at query time, incorrect results, results that change from request to request, and even index corruption and data loss.

In the quest to make Elasticsearch more solid and predictable, we have made a number of changes to make field and type mappings stricter and more reliable. In most cases, we only enforce the new rules when creating new indices in Elasticsearch v2.0, and we have provided a backwards compatibility layer which will keep your old indices functioning as before.

However, in certain cases, such as in the presence of conflicting field mappings as explained below, we are unable to do so. 

You will not be able to upgrade indices with conflicting field mappings to Elasticsearch v2.0. 

If the data in these indices is no longer needed, then you can simply delete the indices, otherwise you will need to reindex your data with correct mappings.

Changing how mappings work is not a decision that we take lightly. Below I explain the problems that exist today and the solutions that we have implemented.

Conflicting field mappings

In the past we have said that document types are “like tables in a database”, which is a nice simple way to explain their purpose. Unfortunately, it is just not true: fields with the same name in different document types in the same index map to the same Lucene field name internally.

If you have an error field which is mapped as an integer in the apache document type and as a string in the nginx document type, Elasticsearch will end up mixing numeric and string data into the same Lucene field. Trying to search or aggregate on this field will either return wrong results, throw an exception, or even corrupt your index.

To resolve this problem, we first considered prefixing field names with the document type name, to make each field completely independent. The advantage of this approach is that document types would behave as real tables.

Unfortunately, it comes with a number of disadvantages:

  • Fields would always require a document type prefix to disambiguate a field in one type from another, or a wildcard to query the same field in multiple types.
  • Querying the same field name across document types would be slower as each field would have to be queried separately.
  • Most searches would require using multi-field queries instead of the simpler match and term queries, which would break most existing queries.
  • Heap usage, disk usage, and I/O would increase as there would be more sparsity and poorer compression.
  • Aggregations across document types would become much slower and more memory hungry as they wouldn’t be able to take advantage of global ordinals.

The solution

In the end, we opted to enforce the rule that all fields with the same name in the same index must have the same mapping, although certain parameters such as copy_to and enabled are still allowed to be specified per-type. This resolves the issues with data corruption, query time exceptions, and incorrect results. Queries and aggregations remain as fast as they are today, and we can maximise compression and minimise heap and disk usage.

The disadvantage of this solution is that users who have treated types as completely separate tables need to revise their approach. This is not as problematic as it sounds. In reality, most field names represent a certain type of data: a created_date is always going to be a date, a number_of_hits field is always going to be numeric. Users who have conflicting field mappings today are either getting wrong results or losing their data to corruption. The real difference is that we’re now enforcing correct behaviour at index time, instead of relying on the user to follow best practice.

While the majority of users have non-conflicting field mappings, there are occasions when conflicts exist, so what techniques are available to deal with these situations? There are a few solutions:

Use separate indices instead of separate types

This is the simplest solution. Indices are completely independent of each other and so behave like real database tables. Cross-index queries work just as well as cross-type queries, and cross-index sorting and aggregations will continue to work as long as the fields being queried have the same data type — the same limitation that applies today.

Rename conflicting fields

When only a few conflicts exist, they can be resolved by changing field names to be more descriptive, either in the application or using Logstash. For instance, two error fields could be renamed to error_code and error_message respectively.

Use copy_to or multi-fields

Fields in different document types are allowed to have different copy_to settings. The original error field can have index set to no to essentially disable it across all document types, but the value of the error field can be copied to the integer error_code field in one type:

PUT my_index/_mapping/mapping_one
{
  "properties": {
    "error": {
      "type": "string",
      "index": "no",
      "copy_to": "error_code"
    },
    "error_code": {
      "type": "integer"
    }
  }
}
			

and to the string error_message field in another type:

PUT my_index/_mapping/mapping_two
{
  "properties": {
    "error": {
      "type": "string",
      "index": "no",
      "copy_to": "error_message"
    },
    "error_message": {
      "type": "string"
    }
  }
}
			

A similar solution can be achieved with multi-fields.

Nested fields for each data type

Sometimes you have no control over the documents sent to Elasticsearch, and over the fields they contain. Besides the potential for conflicts, blindly accepting whatever fields your users send you can result in a mapping explosion. Imagine what happens with documents which use a timestamp or an IP address as a field name.

Instead, a separate nested field can be used for each data type, such as str_val, int_val, date_val, etc. In order to use this approach, a document like this:

{
  "message": "some string",
  "count":   1,
  "date":    "2015-06-01"
}
			

would need to be reformatted by the application into:

{
  "data": [
    {"key": "message", "str_val":  "some_string" },
    {"key": "count",   "int_val":  1             },
    {"key": "date",    "date_val": "2015-06-01"  }
  ]
}
			

While this solution requires more work on the application side, it solves both the conflict problem and the mapping-explosion problem at the same time.

Ambiguous field lookup

Today, it is possible to refer to fields using their “short name”, the full path, or the full path prefixed by the document type. These options lead to ambiguity. Take this mapping for example:

{
  "mappings": {
    "user": {
      "properties": {
        "title": {
          "type": "string"
        }
      }
    },
    "blog": {
      "properties": {
        "title": {
          "type": "string"
        },
        "user": {
          "type": "object",
          "fields": {
            "title": {
              "type": "string"
            }
          }
        }
      }
    }
  }
}
		
  • Does title refer to user.title ,blog.title, or blog.user.title
  • Does user.title refer to user.title or blog.user.title

The answer is: it depends which one Elasticsearch finds first. The field that is selected could even change from request to request, depending on how the mappings are serialised on each node.

In 2.0, you will have to use the full path name without the document type prefix to refer to fields:

  • user.title maps only to the user.title field in the blog type, 
  • title maps to the title field in user and in blog
  • *title will match user.title and both title fields.

How would we differentiate between the title field in user and the title field in blog

We don’t have to. Because of the change explained in Conflicting field mappings, the mapping for the title field is the same in both types. In essence, there is only one field called title.

The type prefix (user. or blog.) used to have the side effect of filtering by the specified type. Querying the field blog.title would find only documents of type blog, not documents of type user. This syntax is misleading because it doesn’t work everywhere: aggregations or suggestions could contain results from any type. For this reason, plus the ambiguity demonstrated above, the type prefix is no longer supported.

IMPORTANT: You will need to update any percolators which make use of short names or the type prefix.

Type meta-fields

Every type has meta-fields like _id, _index, _routing, _parent, _timestamp, most of which supported various configuration options like index, store, or path. We have simplified these settings considerably.

  • _id and _type are no longer configurable.
  • _index may be enabled which will store the index name with the document.
  • _routing may be marked as required only.
  • _size may be enabled only.
  • _timestamp is stored by default.
  • The _boost and _analyzer fields have been removed, and will be ignored on old indices.

It used to be possible to extract the _id, _routing, and _timestamp values from fields in the document body. This functionality has been removed because they required two rounds of document parsing and could result in conflicts. Instead, these values must be set explicitly in the URL or query string instead.

With the exception of the _boost and _analyzer fields, the existing meta-field configuration will be respected on old indices.

Analyzer settings

It used to be possible to specify index and search analyzers at index level, at type level, at field level, and even at document level (with the _analyzer field). Combining tokens from different analysis chains into the same field results in bad relevance scores. With the move to disallowing conflicting field mappings, we have simplified the analysis settings considerably:

  • Each analyzed string field accepts an analyzer setting and a search_analyzer setting (which defaults to value of the analyzer setting). The index_analyzer setting has been replaced by analyzer.
  • If a field with the same name exists in multiple types, all copies of the field must have the same analyzer settings.
  • The type level default analyzer, index_analyzer, and search_analyzer settings are no longer supported.
  • Default analyzers may be set per-index in the index analysis settings, by naming them default or default_search.
  • The per-document _analyzer field is no longer supported and will be ignored on existing indices.

index_name and path

The index_name and path settings have been removed in favour of copy_to, which has been available since Elasticsearch v1.0.0. They will continue to work on existing indices, but will no longer be accepted when creating new indices.

Dynamic mapping updates are synchronous

Today, when indexing a document which contains a previously unseen field, the field is added to the mapping locally, then forwarded to the master, which issues a cluster update to inform all shards of the new mapping. It is possible that the same field could be added to two shards at the same time. It is also possible that these two mappings could be different: one could be a double while the other is a long, or one could be a string while the other is a date.

In these cases, the first mapping to arrive to the master node wins. However, the shard with the “losing” mapping is already using a different data type, and will continue to use it. Later on, perhaps after restarting that node, the shard is moved to a different node and the official mapping from the master is applied. This results in index corruption and data loss.

To prevent this, a shard will now wait for the master to accept the new mapping before allowing indexing to continue. This makes all mapping updates deterministic and safe. Indexing a document containing new fields may be somewhat slower than before, because of the need to wait for acceptance, but the speed of cluster state updates has been greatly improved by two new features:

  • Cluster state diffs:  Whenever possible, only changes to the cluster state are published instead of the entire cluster state.
  • Async shard info requests: During the shard allocation process, the master node sends a request to the data nodes to find out which ones have the most up to date copies of unassigned shards. This used to be a blocking call which would delay changes to the cluster state. As of v1.6.0, this request happens asynchronously in the background, allowing pending tasks like mapping updates to be processed more quickly.

Deleting mappings

Finally, it is no longer possible to delete a type mapping (along with the documents of that type). Even after removing a mapping, information about the deleted fields continues to exist at the Lucene level, which can cause index corruption if fields of the same name are added later on. You can either leave mappings as they are or reindex your data into a new index.

Preparing for 2.0

Determining whether you have conflicting mappings or not can be tricky to do by hand. We have provided the Elasticsearch Migration Plugin to help you to figure it out, and to inform you about some features that you are currently using which have been deprecated or removed in 2.0.

If you have conflicting mappings, you will either need to reindex your data into a new index with correct mappings, or to delete old indices that you no longer need. You will not be able to upgrade to 2.0 without resolving these conflicts.