June 13, 2014

Logstash on OpenShift

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Get started analyzing your logs on OpenShift. While OpenShift lets you tail the logs of your apps, the Elasticsearch/Logstash/Kibana trinity gives you a very flexible and powerful toolchain to visualize and analyze these logs. This article explains how to make a Logstash cartridge on OpenShift. The cartridge feeds your logs into Elasticsearch where you can use the Kibana visualization engine to follow trends, detect anomalies and inspect incidents in your environment.

Introduction

OpenShift is RedHat’s PaaS-initiative, with both a public offering, and an enterprise version that lets you take the increasingly popular Platform-as-a-Service-approach into your own data centers and your private cloud.

Logstash is a tool for managing events and logs. Logstash combined with Elasticsearch and Kibana gives you a very powerful toolchain for searching, analyzing and visualizing your logs. This trinity is popularly called the ELK-stack.

While OpenShift easily lets you tail the logs of all your apps, tailing is not nearly as powerful as the ELK-stack. However, by design, application specific things like logging is left to what OpenShift calls cartridges. A cartridge provides a very specific functionality, and you mix these into the gears of the applications you deploy. A gear is a container with a certain purpose, and your application can be composed of multiple gears.

In this article, we’ll make a simple Logstash-cartridge, which you can easily mix into your application to gain insights into your logs.

Assumptions and Goals

Logstash has many outputs, among them Elasticsearch, Graphite and S3 to name a few. Elasticsearch is one of the most widely used outputs. We will configure our Logstashes to output logs to Elasticsearch, but the approach can easily be generalized to other outputs as well. You could even output to an upstream Logstash, to distinguish between log shipping and log processing.

If you need a hosted Elasticsearch cluster, do give Found a try.

We assume some familiarity with OpenShift. If you do not have any experience with OpenShift, check out their getting started-guide.

The goal is to have a cartridge that you can add to your application, and then watch logs appear in an Elasticsearch-cluster. Then, you can use Kibana to visualize and analyze your logs, and to inspect incidents and follow trends and detect anomalies in your environment.

OpenShift Configuration and Logs

Cartridges are typically customized with application-specific configuration (such as database credentials) through environment variables. One of these variables, $OPENSHIFT_LOG_DIR, indicates where a cartridge should log to.

Cartridges can run anything, so there is no standard logging format, or even a standard timestamp format. This is a universal problem for logging, and is something Logstash handles very well. Logstash has lots of different inputs and filters. In the example that follows, we will configure Logstash with file inputs, some filters for processing Apache logs, timestamps and IPs, finally outputting the logs to Elasticsearch. Logstash can do a lot more, and its documentation is good.

Log formats are highly application specific, so you will need to configure Logstash’s processing accordingly. To have a real example, we’ll look at a simple Python-application that produces access logs:

# Create an application. Anything that produces logs. With this app we will get a webserver running that produces access logs, which is all we need.

$ rhc app create my-app python-2.6
Application Options
-------------------
Domain:     namespace
Cartridges: python-2.6
Gear Size:  default
Scaling:    no

Creating application 'my-app' ... done

[ ... ]

Your application 'my-app' is now available.

  URL:        http://my-app-namespace.rhcloud.com/

[ ... ]

# Configure environment variables, like cluster hostname and authentication options
$ rhc set-env --app my-app --env "OPENSHIFT_LOGSTASH_ES_HOST=dabadeee123-us-east-1.foundcluster.com"
Setting environment variable(s) ... done
$ rhc set-env --app my-app --env "OPENSHIFT_LOGSTASH_ES_USER=readwrite"
Setting environment variable(s) ... done
$ rhc set-env --app my-app --env "OPENSHIFT_LOGSTASH_ES_PASSWORD=secret"
Setting environment variable(s) ... done

# Finally, add the cartridge.
$ rhc cartridge add -a my-app https://cartreflect-claytondev.rhcloud.com/github/foundit/openshift-logstash-cartridge
The cartridge 'https://cartreflect-claytondev.rhcloud.com/github/foundit/openshift-logstash-cartridge' will be downloaded and installed
Adding https://cartreflect-claytondev.rhcloud.com/github/foundit/openshift-logstash-cartridge to application 'my-app' ... done

found-logstash-1.4.1 (Logstash 1.4.1)
-------------------------------------
  From:  https://cartreflect-claytondev.rhcloud.com/github/foundit/openshift-logstash-cartridge
  Gears: Located with python-2.6

After a while, logs should start to show up in the configured Elasticsearch cluster. If you visit your web application, access logs similar to the following should appear in python.log:

1.2.3.4 - - [10/Jun/2014:10:31:17 -0400] "GET / HTTP/1.1" 200 39617 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36"

These will then be picked up by Logstash and indexed to Elasticsearch like this:

{
    "_type": "logs",
    "_source": {
        "tags": ["app-name", "gear-name", "namespace"],
        "@timestamp": "2014-06-10T14:31:18.907Z",
        "host": "ex-std-node7.prod.rhcloud.com",
        "path": "/var/lib/openshift/530001200012cd3502000122/app-root/logs/python.log",
        "message": "1.2.3.4 - - [10/Jun/2014:10:31:17 -0400] \"GET / HTTP/1.1\" 200 39617 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36\"",
        "@version": "1"
    },
    "_index": "logstash-2014.06.10",
    "_id": "dCfV_YUjSwOlISJWQMafaw"
}

While this is a good start, the most interesting information is clumped up in the string 1.2.3.4 - - [10/Jun/2014:10:31:17 -0400] "GET / HTTP/1.1\" 200 39617 \"-\" \"Mozilla/5.0 [...]".

This string follows the Apache Combined-format, and Logstash has a pre-defined pattern for matching it. Generally, if you are using some well-known software, chances are good that you will find Logstash-patterns for it. Have a look in the patterns-directory of Logstash. The Grok Debugger’s Discover-feature can also be useful. If you paste the above log message in there, it will output %{COMBINEDAPACHELOG}. In the next section we will see how to use it.

Customizing the Cartridge

The base cartridge, available on GitHub (foundit/openshift-logstash-cartridge), has a very simple configuration that produces the above log.

To configure Logstash to properly process the log and extract the data, we need to change the configuration a bit. First, we’ll say that the python.log-file is of type apache, and then make sure we subsequently filter any logs of type apache appropriately.

We want Logstash’s configuration input- and filter-section to look something like the following. See the complete logstash.conf.erb on GitHub. logstash.conf.erb is the template used to generate the Logstash configuration file, interpolating environment variables like URLs and authentication information.

input {

    # Anything logged, except python.log
    file {
        path => "<%= ENV['OPENSHIFT_LOG_DIR'] %>*.log"
        # Add some openshift-metadata for filtering purposes.
        tags => ["<%= ENV['OPENSHIFT_APP_NAME'] %>", "<%= ENV['OPENSHIFT_GEAR_NAME'] %>", "<%= ENV['OPENSHIFT_NAMESPACE'] %>"]
        exclude => "python.log"
    }

    # We know that python.log has Apache access logs.
    file {
        path => "<%= ENV['OPENSHIFT_LOG_DIR'] %>python.log"
        tags => ["<%= ENV['OPENSHIFT_APP_NAME'] %>", "<%= ENV['OPENSHIFT_GEAR_NAME'] %>", "<%= ENV['OPENSHIFT_NAMESPACE'] %>"]
        type => "apache"
    }

}

filter {
    if [type] == "apache" {

        # This will extract the different metadata in the log message, like timestamp, IP, method, etc.
        grok {
            match => ["message", "%{COMBINEDAPACHELOG}"]
        }

        # Convert the timeformat into something sensible.
        date {
            match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
        }

        # Annotate log with approximate geo-information.
        geoip {
            source => "clientip"
        }

    }
}

With that configuration, Logstash will produce documents like the following:

{
    "_type": "apache",
    "_source": {
        "ident": "-",
        "tags": ["app-name", "gearname", "namespace"],
        "type": "apache",
        "@timestamp": "2014-06-10T15:19:31.000Z",
        "request": "/",
        "auth": "-",
        "response": "200",
        "referrer": "\"-\"",
        "bytes": "39617",
        "host": "ex-std-node7.prod.rhcloud.com",
        "verb": "GET",
        "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) [...]",
        "timestamp": "10/Jun/2014:11:19:31 -0400",
        "path": "/var/lib/openshift/539709b8e0b8cd3502000122/app-root/logs/python.log",
        "message": "1.2.3.4 - - [10/Jun/2014:11:19:31 -0400] \"GET / HTTP/1.1\" 200 39617 [...]",
        "@version": "1",
        "clientip": "1.2.3.4",
        "httpversion": "1.1",
        "geoip": {
          "region_name": "16",
          "ip": "1.2.3.4",
          "continent_code": "EU",
          "country_name": "Norway",
          "city_name": "Trondheim",
          "timezone": "Europe/Oslo",
          "longitude": 10.416699999999992,
          "country_code3": "NOR",
          "country_code2": "NO",
          "location": [
            10.416699999999992,
            63.41669999999999
          ],
          "latitude": 63.41669999999999,
          "real_region_name": "Sor-Trondelag"
        }
    },
    "_index": "logstash-2014.06.10",
    "_id": "Z1N65K5-QTOwZ9z3jPgEUw"
}

We can do a lot more with this document. We can slice and dice the access logs based on where the user is from, what he visits, easily find problematic requests, and so on. The figure below shows an example of a Kibana dashboard that uses the above data. Using Kibana to analyze access logs

To customize the cartridge, just fork the repository and edit conf/logstash.conf.erb with your desired Logstash-configuration.

Then, customize the URL when you add your cartridge, i.e. change foundit/openshift-logstash-cartridge to your-organization/repository-name.

$ rhc cartridge add -a my-app https://cartreflect-claytondev.rhcloud.com/github/your-organization/your-repo

By default, it will use the master branch. You can pass a ?commit=branch-or-commit-id-option in the URL as well, e.g. https://cartreflect-claytondev.rhcloud.com/github/foundit/openshift-logstash-cartridge?commit=python-sample.

For more information on customizing cartridges, and using cartridges that are not publicly available, see OpenShift’s guide on how cartridges are downloaded.

Log-friendly Mappings

As we have seen, Logstash takes care of sending logs to Elasticsearch. We still need to tell Elasticsearch how to treat those logs. We need to configure the mappings for the resulting indexes. (See also: An Introduction to Elasticsearch Mapping and A Data Exploration Workflow for Mappings)

Typically, Logstash will send logs to the index logstash-YYYY-MM-DD, where YYYY-MM-DD is the date of the log. This lets you limit searches to specific date ranges, and archive or delete old logs.

To configure our log mapping, we need to define an index template, which defines the mapping for all indexes whose name matches logstash-*.

The following is an appropriate mapping for the logs produced above. String-fields are not analyzed by default, except for the field message, the geoip-location is configured to be of type geo_point, clientip as an ip, etc.

{
    "template": "logstash-*",
    "settings": {
        "index.refresh_interval": "5s"
    },
    "mappings": {
        "_default_": {
            "_all": {
                "enabled": true
            },
            "dynamic_templates": [
                {
                    "string_fields": {
                        "match": "*",
                        "match_mapping_type": "string",
                        "mapping": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            ],
            "properties": {
                "geoip": {
                    "properties": {
                        "location": {
                            "type": "geo_point"
                        }
                    }
                },
                "clientip": {
                    "type": "ip"
                },
                "bytes": {
                    "type": "long"
                },
                "message": {
                    "type": "string",
                    "index": "analyzed",
                    "omit_norms": true
                }
            }
        }
    }
}

To apply the template, send a PUT-request to /_template/logstash with the JSON-body:

$ curl https://user:pass@cluster-id.foundcluster.com:9243/_template/logstash -XPUT -d @index-template.json
{"acknowledged":true}

Then, Elasticsearch will use the settings and mappings specified in the index template whenever it creates a new logstash-index.

Summary

We now have a cartridge that can serve as a basis for custom Logstash-cartridges with a more application-specific configuration. Getting your logs into an Elasticsearch cluster for awesome searching and analytics is now just a matter of adding an OpenStack cartridge.