The Root Objectedit

The uppermost level of a mapping is known as the root object. It may contain the following:

  • A properties section, which lists the mapping for each field that a document may contain
  • Various metadata fields, all of which start with an underscore, such as _type, _id, and _source
  • Settings, which control how the dynamic detection of new fields is handled, such as analyzer, dynamic_date_formats, and dynamic_templates
  • Other settings, which can be applied both to the root object and to fields of type object, such as enabled, dynamic, and include_in_all

Propertiesedit

We have already discussed the three most important settings for document fields or properties in Core Simple Field Types and Complex Core Field Types:

type
The datatype that the field contains, such as string or date
index
Whether a field should be searchable as full text (analyzed), searchable as an exact value (not_analyzed), or not searchable at all (no)
analyzer
Which analyzer to use for a full-text field, both at index time and at search time

We will discuss other field types such as ip, geo_point, and geo_shape in the appropriate sections later in the book.

Metadata: _source Fieldedit

By default, Elasticsearch stores the JSON string representing the document body in the _source field. Like all stored fields, the _source field is compressed before being written to disk.

This is almost always desired functionality because it means the following:

  • The full document is available directly from the search results—no need for a separate round-trip to fetch the document from another data store.
  • Partial update requests will not function without the _source field.
  • When your mapping changes and you need to reindex your data, you can do so directly from Elasticsearch instead of having to retrieve all of your documents from another (usually slower) data store.
  • Individual fields can be extracted from the _source field and returned in get or search requests when you don’t need to see the whole document.
  • It is easier to debug queries, because you can see exactly what each document contains, rather than having to guess their contents from a list of IDs.

That said, storing the _source field does use disk space. If none of the preceding reasons is important to you, you can disable the _source field with the following mapping:

PUT /my_index
{
    "mappings": {
        "my_type": {
            "_source": {
                "enabled":  false
            }
        }
    }
}

In a search request, you can ask for only certain fields by specifying the _source parameter in the request body:

GET /_search
{
    "query":   { "match_all": {}},
    "_source": [ "title", "created" ]
}

Values for these fields will be extracted from the _source field and returned instead of the full _source.

Metadata: _all Fieldedit

In Search Lite, we introduced the _all field: a special field that indexes the values from all other fields as one big string. The query_string query clause (and searches performed as ?q=john) defaults to searching in the _all field if no other field is specified.

The _all field is useful during the exploratory phase of a new application, while you are still unsure about the final structure that your documents will have. You can throw any query string at it and you have a good chance of finding the document you’re after:

GET /_search
{
    "match": {
        "_all": "john smith marketing"
    }
}

As your application evolves and your search requirements become more exacting, you will find yourself using the _all field less and less. The _all field is a shotgun approach to search. By querying individual fields, you have more flexbility, power, and fine-grained control over which results are considered to be most relevant.

Note

One of the important factors taken into account by the relevance algorithm is the length of the field: the shorter the field, the more important. A term that appears in a short title field is likely to be more important than the same term that appears somewhere in a long content field. This distinction between field lengths disappears in the _all field.

If you decide that you no longer need the _all field, you can disable it with this mapping:

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "_all": { "enabled": false }
    }
}

Inclusion in the _all field can be controlled on a field-by-field basis by using the include_in_all setting, which defaults to true. Setting include_in_all on an object (or on the root object) changes the default for all fields within that object.

You may find that you want to keep the _all field around to use as a catchall full-text field just for specific fields, such as title, overview, summary, and tags. Instead of disabling the _all field completely, disable include_in_all for all fields by default, and enable it only on the fields you choose:

PUT /my_index/my_type/_mapping
{
    "my_type": {
        "include_in_all": false,
        "properties": {
            "title": {
                "type":           "string",
                "include_in_all": true
            },
            ...
        }
    }
}

Remember that the _all field is just an analyzed string field. It uses the default analyzer to analyze its values, regardless of which analyzer has been set on the fields where the values originate. And like any string field, you can configure which analyzer the _all field should use:

PUT /my_index/my_type/_mapping
{
    "my_type": {
        "_all": { "analyzer": "whitespace" }
    }
}

Metadata: Document Identityedit

There are four metadata fields associated with document identity:

_id
The string ID of the document
_type
The type name of the document
_index
The index where the document lives
_uid
The _type and _id concatenated together as type#id

By default, the _uid field is stored (can be retrieved) and indexed (searchable). The _type field is indexed but not stored, and the _id and _index fields are neither indexed nor stored, meaning they don’t really exist.

In spite of this, you can query the _id field as though it were a real field. Elasticsearch uses the _uid field to derive the _id. Although you can change the index and store settings for these fields, you almost never need to do so.

The _id field does have one setting that you may want to use: the path setting tells Elasticsearch that it should extract the value for the _id from a field within the document itself.

PUT /my_index
{
    "mappings": {
        "my_type": {
            "_id": {
                "path": "doc_id" 
            },
            "properties": {
                "doc_id": {
                    "type":   "string",
                    "index":  "not_analyzed"
                }
            }
        }
    }
}

Extract the doc _id from the doc_id field.

Then, when you index a document:

POST /my_index/my_type
{
    "doc_id": "123"
}

the _id value will be extracted from the doc_id field in the document body:

{
    "_index":   "my_index",
    "_type":    "my_type",
    "_id":      "123", 
    "_version": 1,
    "created":  true
}

The _id has been extracted correctly.

Warning

While this is very convenient, be aware that it has a slight performance impact on bulk requests (see Why the Funny Format?). The node handling the request can no longer use the optimized bulk format to parse just the metadata line in order to decide which shard should receive the request. Instead, it has to parse the document body as well.