13 de mayo de 2014

Managing Elasticsearch Fields When Searching

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Controlling the number of fields returned for search requests is an important aspect of maximizing Elasticsearch's performance. In this article we'll look at how we can selectively return only the fields we're interested in for each search hit in order to optimize our usage pattern.

Introduction

When searching, using the fields-parameter allows to restrict the fields returned for each hit. This is a feature commonly used in order to optimize the amount of data transferred from Elasticsearch per hit by only selecting the relevant fields. While it may sound simple (and it is), the impact on performance may be tremendous, and is proportional both to the number of documents returned and the size of each document.

Consider the case where each document has an average size of 20KB. Returning 10 hits will cause an average 200KB of data to be returned for each search request. Multiply this by 100 search requests/second and we end up needing 20 MB/s bandwidth. If we can reduce the size of the returned documents to 800 bytes, which is plausible by just returning the required fields, each request will be roughly 8KB in size, and the total bandwidth needed for 100 requests/s is only 800 KB/s. Keeping the result sets as small as possible is certainly important in order to be able to scale sensibly.

By default, the _source-field is enabled, and its contents are parsed and returned for each search hit. We can save precious server-side CPU time and disk IO by limiting the returned fields since we no longer have to load and parse the whole document source. Simply specifying fields when searching is enough to trigger this behavior. Should we require to also return the document source in addition to the fields, we can use source filtering, which we'll look at shortly.

One feature about fields that's not commonly known is the ability to select metadata-fields as well. Of particular note is its ability to select the _ttl-field, which actually returns the number of milliseconds until the document expires, not the original lifespan of the document. A very handy feature indeed.

Nested Fields and Nested Data

Sometimes, only selecting simple fields are not enough since we need additional data from nested fields that may be dynamic. In these cases, we cannot use fields because we would end up with the following failure:

{
    ...,
    "_shards": {
        "failures": [
            {
                "index": "index-name",
                "shard": 2,
                "status": 400,
                "reason": "ElasticsearchIllegalArgumentException[field [field_with_nested_data] isn't a leaf field]"
            }
        ]
    },...
}

This exception is thrown because fields are only able to load data from leaf nodes (i.e nodes that have no children). In order to load non-leaf nodes, we have to use the _source-field in the search request which enables a feature called source filtering, which we'll look at next.

Source Filtering

Source filtering allows us to control which parts of the original JSON document are returned for each hit. We can include or exclude parts of the original document based on patterns matching the field name path. It's worth keeping in mind that this only saves us on bandwidth costs between the nodes participating in the search as well as the client, not CPU or Disk, as was the case when using fields. This is because when using source filtering, we still have to load and parse the entire source document for each returned hit in order to match it against the inclusion and exclusion patterns. But still it's an important tool in our optimization toolbelt that is very easy to get started with.

Prior to 1.0, this was known as partial fields, which was deprecated and replaced by source filtering.

Unstored Fields

Returning the field data representation of a field using fielddata_fields is a rather new addition to Elasticsearch (added in 1.0), which can be used to load fields that are not even stored (i.e fields that have store: false set in their mapping).

It's important to note that when returning data from the fielddata, the terms for that field is loaded into memory and cached. This costs additional memory the first time around, but the same caches are used when sorting and faceting on the field in question. In other words, if you're already sorting or faceting on the field, returning it through the use of fielddata_fields is very cheap, but think twice about using it to return data from fields that are otherwise not commonly used.

The field data representation of a field varies on the type and mapping of the field, which includes the analyzer. For example, boolean values are returned through fielddata_fields as the strings T and F, and most string fields are just a collection of sorted terms.

Script Fields

Using script fields enables us to return the result of evaluating a script for each hit. The script may perform some computation based on one or more of the existing fields, including metadata fields, access and return parts of the source document or return basically any value, which is used as the value for the field.

Here is an example that shows how we can see the remaining time and the expiration date of documents in our index, assuming we're using document ttls.

{
    "fields": [
        "_ttl"
    ],
    "script_fields": {
        "expires_at": {
            "script": "new Date(doc._ttl.getValue())"
        }
    }
}

Using script fields requires dynamic scripts to be enabled, and this has some important security implications you need to be aware of.

Conclusion

After looking into our options for selecting only the relevant fields to return for search requests, we're now better suited to optimize the amount of data transferred between our Elasticsearch cluster and our search clients. We've also seen how to return fields and information that are not normally returned by Elasticsearch searches that we still may be interested in.