07 1월 2014

A Data Exploration Workflow for Mappings

By Njal Karevoll

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

In this article we explore a workflow for exploring new data via mapping refinements. We index an example document, look at its default mapping and iteratively improve on it to get us one step closer to our goal.

Introduction

Elasticsearch is for many intents and purposes schema-less, which means that documents can be indexed without explicitly providing a schema. The schema is a mapping that describes the fields in the JSON documents along with their data type and how they should be indexed in the Lucene indexes that lie under the hood.

In this article, I will describe a workflow for working with mappings in order to explore new data sets. I have personally used this workflow many times, and with great success. We'll also look at a new tool (Play) we've developed that greatly speeds up this kind of experimental work.

The Workflow

Without further ado, the workflow is a simple six-step process that may be loosely described like this:

  1. Extract a sufficiently large set of example JSON documents from the data set.
  2. Index the documents.
  3. Run a few sample queries.
  4. Look at the automatically generated mapping.
  5. Refine the generated mapping.
  6. Repeat step 2 to 5 until I'm satisfied with the result.

The number of documents should be sufficiently large enough that the query results can be verified. Using too many documents wastes time, since I try to keep the iterations small. On the other hand, using too few documents makes it difficult to assess the quality of my queries and the correctness of my mapping.

From this workflow, it is evident that to get the most power out of Elasticsearch, it is important to understand how this mapping works.

Looking at the Default Dynamic Mapping

We start by indexing the following JSON document to a fresh Elasticsearch index:

$ curl ...:9200/names/name -XPOST -d '{
        "userId": 10,
        "name": {
            "first": "Katherine",
            "last": "Jones"
        },
        "tags": ["trick manager", "restaurant manager"]
    }'
{"ok":true,"_index":"names","_type":"name","_id":"Q3Yj7mofTIW_YZTjW5Hp9w","_version":1}

By looking at the output from the indexing operation, we see that it was successfully indexed, and was assigned an automatically generated id. Let's look at the current mapping to see how it got stored:

$ curl ...:9200/names/name/_mapping?pretty
{
  "name" : {
    "properties" : {
      "name" : {
        "properties" : {
          "first" : {
            "type" : "string"
          },
          "last" : {
            "type" : "string"
          }
        }
      },
      "tags" : {
        "type" : "string"
      },
      "userId" : {
        "type" : "long"
      }
    }
  }
}

The "dynamic mapping" feature of Elasticsearch has detected that our "userId" field was numeric and selected the long datatype for this field. tags was recognized as a list of strings, which is stored in the same Lucene field. Lastly, name was detected as a nested object with its own two string properties.

When writing a custom mapping, I usually index a few documents before looking at the mapping instead of starting to write a mapping from scratch. The reason being that Elasticsearch is fairly good at giving me some sane defaults to base my work on, and this saves me a lot of time writing boilerplate mapping and debugging simple typos.

Improving the Mapping

A simple query shows one important problem with this generated mapping, faceting on the tags shows the analyzed terms:

$ curl ...:9200/names/name/_search?pretty -XPOST -d '{
    "query": {"match_all": {}},
    "size": 0,
    "facets": {
        "tags": {
            "terms": {"field": "tags"}
        }
    }
}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ ]
  },
  "facets" : {
    "tags" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 3,
      "other" : 0,
      "terms" : [ {
        "term" : "trick",
        "count" : 1
      }, {
        "term" : "restaurant",
        "count" : 1
      }, {
        "term" : "manager",
        "count" : 1
      } ]
    }
  }
}

Tags are generally meant to be keywords, but the default mapping analyzes it with the default analyzer, which tokenizes and lowercases the terms. If we were just looking at the terms from the above tags facet, we could easily believe that someone in the index were tagged as restaurants and tricks.

To fix this, we can update the mapping to store the tags-field as a keyword by setting its index setting to not_analyzed, which means we will not analyze it and store its contents verbatim in the index. Elasticsearch is not able to merge this setting if we update it directly:

$ curl ...:9200/names/name/_mapping -XPOST -d '{
  "name" : {
    "properties" : {
      "tags" : {
        "type" : "string",
        "index": "not_analyzed"
      }
    }
  }
}'
{"error":"MergeMappingException[Merge failed with failures {[mapper [tags] has different index values, mapper [tags] has different tokenize values, mapper [tags] has different index_analyzer]}]","status":400}

However, changing a field by updating its type to multi_field is supported:

$ curl ...:9200/names/name/_mapping -XPOST -d '{
  "name" : {
    "properties" : {
      "tags" : {
        "type" : "multi_field",
        "fields": {
            "tags": {"type": "string"},
            "keyword": {"type": "string", "index": "not_analyzed"}
        }
      }
    }
  }
}'
{"ok":true,"acknowledged":true}

We still have to re-index the data before these new fields are populated, but there's no indexing or querying downtime associated with this method. Thus, it is possible to do this operation on an index that is in use without affecting other users of the index. Note that this might be a nice strategy for staging indexes, but I still recommend avoiding experimenting with production indexes / servers, because there is currently no rollback feature available in Elasticsearch.

$ curl ...:9200/names/name/Q3Yj7mofTIW_YZTjW5Hp9w -XPOST -d '{
        "userId": 10,
        "name": {
            "first": "Katherine",
            "last": "Jones"
        },
        "tags": ["trick manager", "restaurant manager"]
    }'
{"ok":true,"_index":"names","_type":"name","_id":"Q3Yj7mofTIW_YZTjW5Hp9w","_version":1}

By faceting of the "virtual" field tags.keyword, we now get our desired results:

$ curl ...:9200/names/name/_search?pretty -XPOST -d '{
    "query": {"match_all": {}},
    "size": 0,
    "facets": {
        "tags": {
            "terms": {"field": "tags.keyword"}
        }
    }
}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ ]
  },
  "facets" : {
    "tags" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 2,
      "other" : 0,
      "terms" : [ {
        "term" : "trick manager",
        "count" : 1
      }, {
        "term" : "restaurant manager",
        "count" : 1
      } ]
    }
  }
}

Note that Elasticsearch will consider our new field tags.keyword to be empty until the documents in the index have been re-indexed with the updated mapping, but we do not have to re-index all the documents to test and see if the new field we introduced works as expected.

Ending Remarks

Spending some time figuring out an efficient workflow for working with new or large data sets is likely to save a lot of time over the course of a project. Test mapping changes in small increments, and if possible, without indexing your complete document collection to save time both debugging and indexing.

Using tools to accomplish the above can drastically decrease the time between these iterations and make it easier to experiment with new features.