Engineering

Getting started with runtime fields, Elastic’s implementation of schema on read

Historically, Elasticsearch has relied on a schema on write approach to make searching data fast. We are now adding schema on read capabilities to Elasticsearch so that users have the flexibility to alter a document's schema after ingest and also generate fields that exist only as part of the search query. Together, schema on read and schema on write provides users with the choice to balance performance and flexibility based on their needs.

Our solution for schema on read is runtime fields, which are evaluated only at query time. They are defined in the index mapping or in the query, and once defined they are immediately available for search requests, aggregations, filtering, and sorting. Because runtime fields aren’t indexed, adding a runtime field doesn’t increase the index size. They can, in fact, reduce storage costs and increase the speed of ingestion.

However, there are tradeoffs. Queries against runtime fields can be expensive, so data that you commonly search or filter on should still be mapped to indexed fields. Runtime fields can also decrease search speed, even though your index size is smaller. We recommend using runtime fields in tandem with indexed fields to find the right balance between ingest speed, index size, flexibility, and search performance for your use cases.

It’s easy to add runtime fields

The easiest way to define a runtime field is in the query. For example, if we have the following index:

 PUT my_index
 {
   "mappings": {
     "properties": {
       "address": {
         "type": "ip"},
       "port": {
         "type": "long"
       }
     }
   } 
 }

And load a few documents into it:

 POST my_index/_bulk
 {"index":{"_id":"1"}}
 {"address":"1.2.3.4","port":"80"}
 {"index":{"_id":"2"}}
 {"address":"1.2.3.4","port":"8080"}
 {"index":{"_id":"3"}}
 {"address":"2.4.8.16","port":"80"}

We can create the concatenation of two fields with a static string as follows:

 GET my_index/_search
 {
   "runtime_mappings": {
     "socket": {
       "type": "keyword",
       "script": {
         "source": "emit(doc['address'].value + ':' + doc['port'].value)"
       }
     }
   },
   "fields": [
     "socket"
   ],
   "query": {
     "match": {
       "socket": "1.2.3.4:8080"
     }
   }
 }

Yielding the following response:

…
     "hits" : [
       {
         "_index" : "my_index",
         "_type" : "_doc",
         "_id" : "2",
         "_score" : 1.0,
         "_source" : {
           "address" : "1.2.3.4",
           "port" : "8080"
         },
         "fields" : {
           "socket" : [
             "1.2.3.4:8080"
           ]
         }
       }
     ]

We defined the field socket in the runtime_mappings section. We used a short Painless script that defines how the value of socket will be calculated per document (using + to indicate concatenation of the value of the address field with the static string ‘:’ and the value of the port field). We then used the field socket in the query. The field socket is an ephemeral runtime field that exists only for this query and is calculated when the query is run. When defining a Painless script to use with runtime fields, you must include emit to return calculated values.

If we find that socket is a field that we want to use in multiple queries without having to define it per query, we can simply add it to the mapping by making the call:

 PUT my_index/_mapping
 {
   "runtime": {
     "socket": {
       "type": "keyword",
       "script": {
         "source": "emit(doc['address'].value + ':' + doc['port'].value)"
       }
     } 
   } 
 }

And then the query does not have to include the definition of the field, for example:

 GET my_index/_search
 {
   "fields": [
     "socket"
  ],
   "query": {
     "match": {
       "socket": "1.2.3.4:8080"
     }
   }
 }

The statement "fields": ["socket"] is only required if you want to display the value of the socket field. While the field socket is now available to any query, it does not exist in the index and does not increase the index’s size. Socket is calculated only when a query requires it and for the documents for which it is required.

Consumed like any field

Because runtime fields are exposed through the same API as indexed fields, a query can refer to some indices where the field is a runtime field, and other indices where the field is an indexed field. You have the flexibility to choose which fields to index and which ones to keep as runtime fields. This separation between field generation and field consumption facilitates more organized code that is easier to create and maintain.

You define runtime fields in the index mapping or in the search request. This inherent capability provides flexibility in how you use runtime fields in conjunction with indexed fields. 

Override field values at query time

Oftentimes, you realize mistakes in your production data when it's too late.  While it is easy to fix the ingest instructions for documents that you will ingest in the future, it’s much more challenging to fix the data that has already been ingested and indexed. Using runtime fields, you can fix errors in your indexed data by overriding values at query time. Runtime fields can shadow indexed fields with the same name so that you can correct errors in your indexed data.  

Here’s a simple example to make this more concrete. Let’s say we have an index with a message field and an address field:

 PUT my_raw_index 
{
  "mappings": {
    "properties": {
      "raw_message": {
        "type": "keyword"
      },
      "address": {
        "type": "ip"
      }
    }
  }
}

And let’s load a document into it:

 POST my_raw_index/_doc/1
{
  "raw_message": "199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] GET /history/apollo/ HTTP/1.0 200 6245",
  "address": "1.2.3.4"
}

Alas, the document contains a wrong IP address in the address field. The correct IP address exists in the message but somehow the wrong address was parsed out in the document that was sent to be ingested into Elasticsearch and indexed. For a single document, that’s not a problem, but what if we discover after a month that 10% of our documents contain a wrong address? Fixing it for new documents is not a big deal, but reindexing the documents that were already ingested is frequently operationally complex. With runtime fields, it can be fixed immediately, by shadowing the indexed field with a runtime field. Here is how you would do it in a query:

GET my_raw_index/_search
{
  "runtime_mappings": {
    "address": {
      "type": "ip",
      "script": "Matcher m = /\\d+\\.\\d+\\.\\d+\\.\\d+/.matcher(doc[\"raw_message\"].value);if (m.find()) emit(m.group());"
    }
  },
  "fields": [ 
    "address"
  ]
}

You can also do the change in the mapping so that it is made available for all queries. Note that the use of regex is now enabled by default through Painless script.

Balance performance and flexibility

With indexed fields, you make all the preparations during ingest and maintain the sophisticated data structures to provide optimal performance. But querying runtime fields is slower than querying indexed fields. So what if your queries are slow after you start using runtime fields?

We recommend using asynchronous search when retrieving a runtime field. The full result set is returned just like in a synchronous search, provided that the query completes within a given time threshold. However, even if the query doesn't finish in that time, you still get a partial result set and Elasticsearch will continue polling until the complete result set is returned. This mechanism is particularly useful when managing an index lifecycle because newer results typically return first and are also typically more important to users.

To provide optimal performance, we rely on the indexed fields to do the heavy lifting of the query so that the values of runtime fields are only calculated for a subset of the documents.

Changing a field from runtime to indexed

Runtime fields allow users to flexibly change their mapping and parsing while working on data in a live environment. Because a runtime field does not consume resources, and because the script that defines it can be changed, users can experiment until they reach the optimal mapping. When a runtime field is found to be useful for the long term, it is possible to precalculate its value at index time by simply defining that field in the template as an indexed field and making sure that the ingested document includes it. The field will be indexed from the next index rollover and provide better performance. The queries that use the field do not need to change at all. 

This scenario is particularly useful with dynamic mapping. On the one hand, it is very helpful to allow new documents to generate new fields, because that way the data in them can be immediately used (the structure of entries frequently changes, e.g., due to a change in the software that generates the log). On the other hand, dynamic mapping comes with the risk of burdening the index and even creating a mapping explosion, because you never know if some document might surprise you with 2000 new fields. Runtime fields can provide a solution to this scenario. The new fields can be automatically created as runtime fields so as not to burden the index (since they do not exist in the index), and they are not counted in the index.mapping.total_fields.limit. These automatically created runtime fields are queryable, albeit with lower performance, so users can use them and, if needed, decide to change them to indexed fields in the next rollover.   

We recommend using runtime fields initially to experiment with your data structure. After working with your data, you might decide to index a runtime field for better search performance. You can create a new index and then add the field definition to the index mapping, add the field to _source and make sure the new field is included in the ingested documents. If you're using data streams, you can update your index template so that when indices are created from that template, Elasticsearch knows to index that field. In a future release, we plan to make the process of changing a runtime field to an indexed field as simple as moving the field from the runtime section of the mapping to the properties section. 

The following request creates a simple index mapping with a timestamp field. Including "dynamic": "runtime" instructs Elasticsearch to dynamically create additional fields in this index as runtime fields. If a runtime field includes a Painless script, the value of the field will be calculated based on the Painless script. If a runtime field is created without a script, as shown in the following request, the system will look for a field in _source that has the same name as the runtime field and will use its value as the value of the runtime field.

PUT my_index-1
{
  "mappings": {
    "dynamic": "runtime",
    "properties": {
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

Let’s index a document to see the advantages of these settings:

POST my_index-1/_doc/1
{
  "timestamp": "2021-01-01",
  "message": "my message",
  "voltage": "12"
}

Now that we have an indexed timestamp field and two runtime fields (message and voltage), we can view the index mapping:

GET my_index-1/_mapping

The runtime section includes message and voltage. These fields are not indexed, but we can still query them exactly as if they were indexed fields.

{
  "my_index-1" : {
    "mappings" : {
      "dynamic" : "runtime",
      "runtime" : {
        "message" : {
          "type" : "keyword"
        },
        "voltage" : {
          "type" : "keyword"
        }
      },
      "properties" : {
        "timestamp" : {
          "type" : "date",
          "format" : "yyyy-MM-dd"
        }
      }
    }
  }
}

We’ll create a simple search request that queries on the message field:

GET my_index-1/_search
{
  "query": {
    "match": {
      "message": "my message"
    }
  }
}

The response includes the following hits:

... 
"hits" : [
      {
        "_index" : "my_index-1", 
        "_type" : "_doc", 
        "_id" : "1", 
        "_score" : 1.0, 
        "_source" : { 
          "timestamp" : "2021-01-01", 
          "message" : "my message", 
          "voltage" : "12" 
        } 
      } 
    ]
…

Looking at this response, we notice a problem: we didn’t specify that voltage is a number! Because voltage is a runtime field, that’s easy to fix by updating the field definition in the runtime section of the mapping:

PUT my_index-1/_mapping
{
  "runtime":{
    "voltage":{
      "type": "long"
    }
  }
}

The previous request changes voltage to a type of long, which immediately takes effect for documents that were already indexed. To test that behavior, we construct a simple query for all documents with a voltage between 11 and 13:

GET my_index-1/_search
{
  "query": {
    "range": {
      "voltage": {
        "gt": 11,
        "lt": 13
      }
    }
  }
}

Because our voltage was 12, the query returns our document in my_index-1. If we view the mapping again, we’ll see that voltage is now a runtime field of type long, even for documents that were ingested into Elasticsearch before we updated the field type in the mapping:

...
{
  "my_index-1" : {
    "mappings" : {
      "dynamic" : "runtime",
      "runtime" : {
        "message" : {
          "type" : "keyword"
        },
        "voltage" : {
          "type" : "long"
        }
      },
      "properties" : {
        "timestamp" : {
          "type" : "date",
          "format" : "yyyy-MM-dd"
        }
      }
    }
  }
}
…

Later, we might decide that voltage is useful in aggregations and we want to index it into the next index that's created in a data stream. We create a new index (my_index-2) that matches the index template for the data stream and define voltage as an integer, knowing which data type we want after experimenting with runtime fields.

Ideally, we would update the index template itself so that changes take effect on the next rollover. You can run queries on the voltage field in any index matching the my_index* pattern, even though the field is a runtime field in one index and an indexed field in another.

PUT my_index-2
{
  "mappings": {
    "dynamic": "runtime",
    "properties": {
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd"
      },
      "voltage":
      {
        "type": "integer"
      }
    }
  }
}

With runtime fields, we have therefore introduced a new field lifecycle workflow. In this workflow, a field can automatically be generated as a runtime field without impacting resource consumption and without risking mapping explosion, allowing users to immediately start working with the data. The field’s mapping can be refined on real data while it is still a runtime field, and due to the flexibility of runtime fields, the changes take effect on documents that were already ingested into Elasticsearch. When it is clear that the field is useful, the template can be changed so that in the indexes that will be created from that point on (after the next rollover), the field will be indexed for optimal performance.

Summary

For the great majority of cases, and in particular, if you know your data and what you wish to do with it, indexed fields are the way to go due to their performance advantage. On the other hand, when there is a need for flexibility in document parsing and schema structure, runtime fields now provide the answer.

Runtime fields and indexed fields are complementing features — they form a symbiosis. Runtime fields offer flexibility, but they would never be able to perform well in a high-scale environment without the support of the index. The powerful, rigid structure of the index provides a sheltered environment in which the flexibility of runtime fields can boast their true colors, in a way not too different from how algae find shelter in a coral. Everyone benefits from this symbiosis.

Get started today

To get started with runtime fields, spin up a cluster on the Elasticsearch Service or install the latest version of the Elastic Stack. Already have Elasticsearch running? Just upgrade your clusters to 7.11 and give it a try. For a higher-level view of runtime fields and its benefits, please read the Runtime fields: Schema on read for Elastic blog post. In addition, we have also recorded 4 vidoes to help you get started using runtime fields.