Types and Mappings

edit

A type in Elasticsearch represents a class of similar documents. A type consists of a name—such as user or blogpost—and a mapping. The mapping, like a database schema, describes the fields or properties that documents of that type may have, the datatype of each field—​such as string, integer, or date—and how those fields should be indexed and stored by Lucene.

Types can be useful abstractions for partitioning similar-but-not-identical data. But due to how Lucene operates they come with some restrictions.

How Lucene Sees Documents

edit

A document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process. Lucene doesn’t care if the values are strings or numbers or dates—​all values are just treated as opaque bytes.

When we index a document in Lucene, the values for each field are added to the inverted index for the associated field. Optionally, the original values may also be stored unchanged so that they can be retrieved later.

How Types Are Implemented

edit

Elasticsearch types are implemented on top of this simple foundation. An index may have several types, and documents of any of these types may be stored in the same index.

Because Lucene has no concept of document types, the type name of each document is stored with the document in a metadata field called _type. When we search for documents of a particular type, Elasticsearch simply uses a filter on the _type field to restrict results to documents of that type.

Lucene also has no concept of mappings. Mappings are the layer that Elasticsearch uses to map complex JSON documents into the simple flat documents that Lucene expects to receive.

For instance, the mapping for the name field in the user type may declare that the field is a string field, and that its value should be analyzed by the whitespace analyzer before being indexed into the inverted index called name:

"name": {
    "type":     "string",
    "analyzer": "whitespace"
}

Avoiding Type Gotchas

edit

This leads to an interesting thought experiment: what happens if you have two different types, each with an identically named field but mapped differently (e.g. one is a string, the other is a number)?

Well, the short answer is that bad things happen and Elasticsearch won’t allow you to define this mapping at all. You’d receive an exception when attempting to configure the mapping.

The longer answer is that each Lucene index contains a single, flat schema for all fields. A particular field is either mapped as a string, or a number, but not both. And because types are a mechanism added by Elasticsearch on top of Lucene (in the form of a metadata _type field), all types in Elasticsearch ultimately share the same mapping.

Take for example this mapping of two types in the data index:

{
   "data": {
      "mappings": {
         "people": {
            "properties": {
               "name": {
                  "type": "string",
               },
               "address": {
                  "type": "string"
               }
            }
         },
         "transactions": {
            "properties": {
               "timestamp": {
                  "type": "date",
                  "format": "strict_date_optional_time"
               },
               "message": {
                  "type": "string"
               }
            }
         }
      }
   }
}

Each type defines two fields ("name"/"address" and "timestamp"/"message" respectively). It may look like they are independent, but under the covers Lucene will create a single mapping which would look something like this:

{
   "data": {
      "mappings": {
        "_type": {
          "type": "string",
          "index": "not_analyzed"
        },
        "name": {
          "type": "string"
        }
        "address": {
          "type": "string"
        }
        "timestamp": {
          "type": "long"
        }
        "message": {
          "type": "string"
        }
      }
   }
}

Note: This is not actually valid mapping syntax, just used for demonstration

The mappings are essentially flattened into a single, global schema for the entire index. And that’s why two types cannot define conflicting fields: Lucene wouldn’t know what to do when the mappings are flattened together.

Type Takeaways

edit

So what’s the takeaway from this discussion? Technically, multiple types may live in the same index as long as their fields do not conflict (either because the fields are mutually exclusive, or because they share identical fields).

Practically though, the important lesson is this: types are useful when you need to discriminate between different segments of a single collection. The overall "shape" of the data is identical (or nearly so) between the different segments.

Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems. In these cases, it’s much better to utilize two independent indices.

In summary:

  • Good: kitchen and lawn-care types inside the products index, because the two types are essentially the same schema.
  • Bad: products and logs types inside the data index, because the two types are mutually exclusive. Separate these into their own indices.