30 March 2017 Engineering

​Little Logstash Lessons: Using Logstash to help create an Elasticsearch mapping template

By Aaron Mildenstein

In Little Logstash Lessons Part 1, you configured Logstash to parse your data and send it to Elasticsearch.  Elasticsearch does a terrific job of guessing what your data types are, and how to handle them.  But what if you know you’re going to have a lot of data, and you want to tune Elasticsearch to store that data as efficiently as possible?  This has a lot of benefits, including reduced storage requirements, but it can also help reduce memory requirements for aggregations, and other large and complex queries.  The way to do this is with a mapping template.

What are mappings?


Simply put, mappings define how fields are interpreted by Elasticsearch.  Elasticsearch is powerful because you can send data to it, and it will make very intelligent guesses as to what each field’s mapping should be.  We refer to this as “schema-less” indexing, which is a terrific option if you’re just getting started with Elasticsearch.  It’s also great in cases where you are continuously sending new data and you do not know what fields will be in it. As mentioned, however, there are great benefits in knowing what data is going to Elasticsearch, and “mapping,” or telling Elasticsearch exactly how that data should be treated.  For example, if I send an IP address (like 8.8.8.8, which is one of Google’s DNS servers), it will appear as a string field in the JSON sent to Elasticsearch:

{ “ip” : “8.8.8.8” }


Even though the IP address is an IP to you and me, Elasticsearch will think of it as a string.  For it to be more, you have to tell Elasticsearch how to treat the value, and the way to do that is with a mapping. If an IP is mapped as type ip in Elasticsearch, you can do a query, and filter by an IP range—you can even use CIDR notation—to include only IPs within a given subnet.


While there is a lot of documentation about mappings themselves—and some pointers will be provided here—the primary goal of this post is to show you a relatively easy way to get Elasticsearch to create a mapping for you, and then take that and make a mapping template out of it.


NOTE: For the purposes of this exercise, we will be using versions 5.3.0 of Logstash and Elasticsearch.  If you are using an older version of Elasticsearch, particularly a 2.x version, your mapping will probably look very different from this one.

Get Logstash to send sample data to Elasticsearch


Let’s try an example with apache access log data.

input { stdin {} }
filter {  
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
    remove_field => "message"
  }
  date {
    match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
    locale => en
    remove_field => ["timestamp"]
  }
  geoip {
    source => "clientip"
  }
  useragent {
    source => "agent"
    target => "useragent"
  }
}
output {
  elasticsearch {
    hosts => [ "127.0.0.1" ]
    index => "my_index"
    document_type => "mytype"
  }
}


With a setup like this, you are sending apache output data to a specifically named index.  This index will only be temporary, so name it uniquely enough that it won’t be too similar to any other index name.  The best part is, I can do this on my laptop.  I don’t need to run this on a production node, or even a staging one.  I can spin up a local instance of Elasticsearch and Logstash and get the results I need that way.


Let’s send a few lines now:

head -50 /path/to/apache/access_log | /usr/share/logstash/bin/logstash -f above_sample.conf


And the output in the Elasticsearch node will look something like this:

[2017-03-29T13:48:42,794][INFO ][o.e.c.m.MetaDataCreateIndexService] [iXsjSQY] [my_index] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []

Get and edit the mapping


Now that we have an index, let’s see what our mapping looks like:

curl -XGET http://127.0.0.1:9200/my_index/_mapping?pretty > my_mapping.json


Now if you edit my_mapping.json in your favorite text editor (something that won’t reformat the file, like a word processor), you’ll see something like this:

{
  "my_index" : {
    "mappings" : {
      "mytype" : {
        "properties" : {


...followed by a few hundred other lines of content.  Let’s look at what we have here:


Of course, my_index is the index name we specified, and mappings indicates that the following defines the mappings for the index. mytype is the document_type we specified in the elasticsearch output plugin, and properties is where all of the fields will be defined.


As you can see, almost every field is of type text, and contains a secondary sub-field type of keyword. The keyword type is important because it treats the entire contents of the field (up to the ignore_above number of characters) as a single entity.  This ensures that a user agent field containing operating system names like “Windows 10” are identified by Elasticsearch as “Windows 10” rather than have “Windows” and “10” separated.


What about some of the other fields? Take a look at the nested fields in the geoip section:

         "geoip" : {
            "properties" : {
              "city_name" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {


Do you see how there’s another properties block within geoip?  It’s just like the one encapsulating all of the fields.  Whenever you encounter a new level in a mapping definition, you’ll see another properties block.


Within the geoip object, there are a few numeric fields, and an IP field:

              "dma_code" : {
                "type" : "long"
              },
              "ip" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              },
              "latitude" : {
                "type" : "float"
              },
              "location" : {
                "type" : "float"
              },
              "longitude" : {
                "type" : "float"
              },


At this point, I want to make a few important points about Logstash and grok, and their relation to Elasticsearch mapping. No matter how you may cast a field with grok (which supports int and float types), Elasticsearch will not know exactly what you intend unless you map it exactly.  Logstash will render your log line as JSON, containing all of the fields you configured it to contain.  Typically, this will contain strings and numbers. Numbers will be either integers (whole numbers) or floating point values (numbers with decimals).  After that, it’s up to Elasticsearch to interpret these field types, either by having them explicitly mapped, or through the previously mentioned schema-less approach.


We have already seen how well Elasticsearch handles strings and text values when using a schema-less approach. It also makes its best guess with numeric values.  Generally, any integer value will turn up as a long. Values with decimals will show up as floats, or doubles. As already mentioned, you can save storage space as well as some overhead for calculations and aggregations if you map your values with the smallest possible primitive value. This is one of the primary benefits of creating a custom mapping. For instance, latitude and longitude can easily be represented by half_floats, and the dma_code can be represented as a short. Each of these changes reduces overhead in Elasticsearch.


And what about our IP address example from the beginning?

         "ip" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },


This ip field is being identified as text and keyword, and not as an IP.  To map an IP field as type ip, replace that entire block with this:

          "ip" : { "type" : "ip" },


Pretty simple, right?  At this point, you may wonder why I am instructing you to only map the ip field in the geoip object, rather than the clientip field, or both.  This is because in certain web server configurations, the log data that becomes the clientip field can contain an IPv4 address, an IPv6 address, or a host name.  If you were to map the clientip field as type ip, and a log line with a host name were to be pushed to Elasticsearch it would result in an error, and that document would not be indexed due to a mapping conflict.


The location field is also a special mapping type. It is an array of longitude and latitude, indicating a single point on a map. Elasticsearch stores these as mapping type geo_point, which enables users to do geographic searches and plot points on maps in Kibana.  Map a geo_point like this:

              "location" : { "type" : "geo_point" },


With all of these changes, the edited mapping for this section should appear as follows:

              "dma_code" : { "type" : "short" },
              "ip" : { "type" : "ip" },
              "latitude" : { "type" : "half_float" },
              "location" : { "type" : "geo_point" },
              "longitude" : { "type" : "half_float" },


The response field, which is outside the geoip block, is also a number, but it ended up stored as a string.  This is because the grok subpattern that created this field does not cast it as an int or float.  That doesn’t matter, though! Mappings can coerce a number stored as a string into a numeric mapping type. Since HTTP response codes will only ever have 3 digits, and the smallest mapping type to account for that is a short, it should look like this:

          "response" : { "type" : "short" },

Make the template


So now we have a big mapping file with a few changes.  How do we change this mapping file into a template?


It's actually quite simple to turn a regular mapping into a mapping template!  This is sample of what a mapping template might look like, including where to put your edited properties section:

{
  "template" : "my_index*",
  "version" : 50001,
  "settings" : {
    "index.refresh_interval" : "5s"
  },
  "mappings" : {
    "_default_" : {
    YOUR PROPERTIES SECTION GOES HERE
    }
  }
}


An elasticsearch template at its most basic contains an index pattern to match, identified as template, and your default mappings.  In this example, I’ve included a version, in case you want to track changes to your template, and a default refresh_interval of 5 seconds, which helps performance under higher indexing loads.  


To make this a complete template, you will need to replace a few lines from the my_mapping.json file.  Copy that file to my_template.json, and open it in your editor.  Then replace the 4 top lines, which should look like this:

{
  "my_index" : {
    "mappings" : {
      "mytype" : {


With these lines:

{
  "template" : "my_index*",
  "version" : 50001,
  "settings" : {
    "index.refresh_interval" : "5s"
  },
  "mappings" : {
    "_default_" : {


The very next line should be:

        "properties" : {


IMPORTANT:
We removed 4 opening curly braces and replaced them with 3.  That means we have one too many closing curly braces at the end of our file, so scroll down and remove one of them.

      }
    }
  }
}


Should be:

    }
  }
}

Save the template and upload it


Save your file, and you’re ready to upload it:

curl -XPUT http://localhost:9200/_template/my_index_template?pretty -d @my_template.json


If your mapping is accurate, you will see:

{
  "acknowledged" : true
}

Test the template


So, now it’s time to test our template. Delete my_index from Elasticsearch by running the first line in the Kibana Console:

DELETE /my_index


or this line at the command-line:

curl -XDELETE http://localhost:9200/my_index?pretty


Re-run the Logstash sample command you ran earlier to test your template. When you check the mapping of my_index, you will see all of your changes!  


Happy Logstashing!