23 9월 2013

An Introduction to Elasticsearch Mapping

By Njal Karevoll

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

This article will give an introduction to the mapping feature of Elasticsearch. We'll define the key terms and take a closer look at what mapping is, when we specify it, how it is structured and how we can apply it to our data.

What is a Schema?

A schema is a description of one or more fields that describes the document type and how to handle the different fields of a document.

The schema in Elasticsearch is a mapping that describes the the fields in the JSON documents along with their data type, as well as how they should be indexed in the Lucene indexes that lie under the hood. Because of this, in Elasticsearch terms, we usually call this schema a “mapping”.

Conceptually, an Elasticsearch server contains zero or more indexes. An index is a container for zero or more types, which in turn has zero or more documents. To put it another way: a document has an identifier, belongs to a type, which belongs to an index. The figure below shows the documents A, B, X and Y of the type my_type, inside the index my_index.

Indexes, types and documents

Indexes, types and documents

The type called another_type and the index called another is shown in order to emphasize that Elasticsearch is multi-tenant, by which we mean that a single server can store multiple indexes and multiple types.

In the Elasticsearch documentation and related material, we often see the term “mapping type”, which is actually the name of the type inside the index, such as my_type and another_type in the figure above. When we talk about types in Elasticsearch, it is usually this definition of type. It is not to be confused with the type key inside each mapping definition that determines how the data inside the documents are handled by Elasticsearch.

When to Specify a Custom Mapping

Elasticsearch has the ability to be schema-less, which means that documents can be indexed without explicitly providing a schema.

If you do not specify a mapping, Elasticsearch will by default generate one dynamically when detecting new fields in documents during indexing. However, this dynamic mapping generation comes with a few caveats:

  1. Detected types might not be correct.
  2. May lead to unneccesary duplication. (The _source field and _all field especially.)
  3. Uses default analyzers and settings for indexing and searching.

For example, a timestamp is often represented in JSON as a long, but Elasticsearch will be unable to detect the field as a date field, preventing date filters and facets such as the date histogram facet from working properly.

By explicitly specifying the schema, we can avoid these problems.

What Does a Mapping Look Like?

The mapping is usually provided to Elasticsearch as JSON, and is a hierarchically structured format where the root is the name of the type the mapping applies to.

The Mapping Root

At the root level of the mapping, right under the type name, Elasticsearch supports a few “special” fields to configure how we should treat metadata that is not part of the document being posted, such as its type, id, size and its fallback _all field. For an extensive list of supported “special” fields, see the list of “fields” on the right-hand side of Mapping Reference.

The root object can also have a few extra attributes, which set the default indexing and search analyzers for the type, the date formats for automatically parsing dates in the type, and dynamic templates - we’ll revisit these in a future article. Apart from the fields mentioned above, there is no difference between the root level and the other mapping levels for a nested JSON document.

Hierarchical Levels

Each level usually defines a properties setting, which maps the keys of the document at that level to its mapping. This structure is hierarchical, which means that every level down to the leaf nodes may include properties settings for its child values. To visualize it, consider the following document and its mapping:

Document:

{
    "name": {
        "first": "John"
    }
}

Mapping:

{
  "my_type" : {
    "properties" : {
      "name" : {
        "properties" : {
          "first" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

While the mapping is slightly more convoluted than the document, the structure of the mapping clearly follows the structure of the document, with the added “properties”-nodes.

The type Key

In the above example, we see that the document field name.first differs from the rest of the structure in that it defines a type. The type key is used on the leaf levels to tell Elasticsearch how to handle the field at the given level in the document. If the type key is omitted, as in the case of non-leaf types, Elasticsearch assumes it is of the object type.

The string type is one of the built-in core types, and Elasticsearch comes with support for many different types, such as geo_point and ip, which can be used to effectively index and search geographical locations and IPv4 addresses respectively. Using the multi_field type, we can even index a single document field into multiple virtual fields. We’ll elaborate on this in a future article.

How to Provide a Mapping

There are two ways of providing a mapping to Elasticsearch. The most common way is during index creation:

curl -XPOST ...:9200/my_index -d '{
    "settings" : {
        # .. index settings
    },
    "mappings" : {
        "my_type" : {
            # mapping for my_type
        }
    }
}'

Another way of providing the mapping is using the Put Mapping API.

$ curl -XPUT 'http://localhost:9200/my_index/my_type/_mapping' -d '
{
    "my_type" : {
        # mapping for my_type
    }
}
'

Note that the type (my_type) is duplicated in the request path and the request body.

This API enables us to update the mapping for an already existing index, but with some limitations with regards to potential conflicts. New mapping definitions can be added to the existing mapping, and existing types may have their configuration updated, but changing the types is considered a conflict and is not accepted. It is, however, possible to pass ignore_conflicts=true as a parameter to the Mapping API, but doing so does not guarantee producing the expected result, as already indexed documents are not re-indexed automatically with the new mapping.

Because of this, specifying the mapping during creation of the indexes is recommended over using the Put Mapping API in most cases.

Closing Remarks

I have now introduced you to the schema/mapping in Elasticsearch and demonstrated how the mapping is a hierarchical definition of data types. In a later article, I will go into more detail about a workflow I use when I explore new datasets with Elasticsearch.