Tech Topics

Strings are dead, long live strings!

Text vs. keyword

With the release of Elasticsearch 5.0 coming closer, it is time to introduce one of the release highlights of this upcoming release: the removal of the string type. The background for this change is that we think the string type is confusing: Elasticsearch has two very different ways to search strings. You can either search whole values, that we often refer to as keyword search, or individual tokens, that we usually refer to as full-text search. If you are familiar with Elasticsearch, you know the former strings should be mapped as a not_analyzed string while the latter should be mapped as an analyzed string.

But the fact that the same field type is used for these two very different use-cases is causing problems since some options only make sense for one of the use case. For instance, position_increment_gap makes little sense for a not_analyzed string and it is not obvious whether ignore_above applies to the whole value or to individual tokens in the case of an analyzed string (in case you wonder: it does apply to the whole value, limits on individual tokens can be applied with the limit token filter).

To avoid these issues, the string field has split into two new types: text, which should be used for full-text search, and keyword, which should be used for keyword search.

New defaults

At the same time we did this split, we decided to change the default dynamic mappings for string fields. When getting started with Elasticsearch, a common frustration is that you have to reindex in order to be able to aggregate on whole field values. For instance imagine you are indexing documents with a city field. Aggregating on this field would give different counts for new and york instead of having a single count for New York which is usually the expected behaviour. Unfortunately, fixing this problem requires to reindex the field in order for the index to have the correct structure to answer this question.

To make things better, Elasticsearch decided to borrow an idea that initially stemmed from Logstash: strings will now be mapped both as text and keyword by default. For instance, if you index the following simple document:

{
  "foo": "bar"
}

Then the following dynamic mappings will be created:

{
  "foo": {
    "type" "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

As a consequence, it will both be possible to perform full-text search on foo, and keyword search and aggregations using the foo.keyword field.

Disabling this feature is easy: all you need to do is to either map string fields explicitly or to use a dynamic template that matches all string fields. For instance the below dynamic template can be used to restore the same dynamic mappings that were used in Elasticsearch 2.x:

{
  "match_mapping_type": "string",
  "mapping": {
    "type": "text"
  }
}

How to migrate

In most cases, the migration should be pretty straightforward. Fields that used to be mapped as an analyzed string

{
  "foo": {
    "type" "string",
    "index": "analyzed"
  }
}

Now need to be mapped as a text field:

{
  "foo": {
    "type" "text",
    "index": true
  }
}

And fields that used to be mapped as a not_analyzed string

{
  "foo": {
    "type" "string",
    "index": "not_analyzed"
  }
}

Now need to be mapped as a keyword field:

{
  "foo": {
    "type" "keyword",
    "index": true
  }
}

As you can see, now that string fields have split into text and keyword, we do not need to have 3 states for the index property (analyzed, not_analyzed and no), which only existed because of string fields. We can use a simple boolean in order to tell Elasticsearch whether searching the field should be possible.

Backward compatibility

Because major upgrades usually have their own challenges, we did our best not require you to upgrade all mappings at the same time as you upgrade your cluster to Elasticsearch 5.0. First, the string field will keep working on existing 2.x indices. And when it comes to new indices, Elasticsearch has some logic that will make it automatically convert string mappings to an equivalent text or keyword mapping. This is especially useful if you have index templates that add mappings with string fields: these templates will keep working with Elasticsearch 5.x. That said, you should still look into upgrading them since we plan on removing this backward compatibility layer when we release Elasticsearch 6.0.