30 Juni 2015

Elasticsearch Logging Secrets

Von Andrei Stefan

Like every other software out there, Elasticsearch provides some insight into what's happening with its internals while running. This is an informative process and can reveal quickly if something bad happened (in the form of an error message), or it can decisively determine the cause of a performance issue or why the cluster is behaving in a way users might not like.

The Basics

Main Elasticsearch logs are written to ES_HOME/logs/[cluster_name].log file. For this file the default level is INFO, thus being sufficient for a rather moderate amount of information and, at the same time, not create a huge log file.

Logging settings are specified in the logging.yml configuration file. While this is a convenient way of having the log settings in a single place, having to modify this file on every node every time you want to change the settings is painful because it requires nodes restart and unnecessary downtime.

Good news! Elasticsearch allows you to  update the logging settings dynamically.

The same applies to  the second type of logging that it provides: slowlogs. These provide a very useful insight into how fast or slow queries/indexing process/fetches are performing. The slowlog file will show the query itself, on which shards it ran and how much time it took to complete.

For the logs, there is a list of "levels" which indicate how granular the logs will be. In the order of granularity from low to high, we have: WARN, INFO, DEBUG, TRACE. Setting the log level to WARN will display only WARN logging messages. On the other hand, setting the level to INFO will display both WARN and INFO logging messages, so this is a "cumulative" set of log messages. Similarly, for TRACE every log message in Elasticsearch will be printed in the log files. Setting the level to TRACE will create a considerable amount of chatter.

Dynamically changing the root logger level

This is not necessarily a tricky task :-), but for completeness sake and to warm up before the following tips&tricks let's first change the logging level for the root logger (the one from which all other loggers inherit). You can run a query like the following, while the cluster is running of course:

PUT /_cluster/settings
{"transient":{"logger._root":"DEBUG"}}

Very simple and quick. Now you should see some more interesting things in the logs, some messages that weren't available with INFO or WARN levels.

Dynamically changing the logger level per Java package

Many times it is very useful to see logging messages that are related to a certain part or functionality of Elasticsearch. Increasing the logging level globally in all modules in Elasticsearch would result in a potentially very large log file, so Elasticsearch allows  individual "sections" to be configured separately.

These "sections" are the Java packages and the granularity logging level can be set per Java package. Let's take, for example, the package org.elasticsearch.discovery.zen. It contains classes that deal with Zen discovery in Elasticsearch. If you have an issue with nodes joining/leaving the cluster, or with master election, this would be the package to monitor in your logs:

PUT /_cluster/settings
{"transient": {"logger.discovery.zen":"TRACE"}}

You notice in the command that "logger" is again present and, yes, it should always be. But, how did we choose "discovery.zen". You remember I mentioned above that the logging level can be set per Java package. Well, "discovery.zen" is just that, without the already present "org.elasticsearch". What if you are interested only in the master node fault detection messages? The Java package containing classes for master fault detection is org.elasticsearch.discovery.zen.fd. And the Elasticsearch command would be:

PUT /_cluster/settings
{"transient": {"logger.discovery.zen.fd":"TRACE"}}

You got the picture. Some ideas on what to monitor and when:

  • org.elasticsearch.env for logging messages referring to node's data paths
  • org.elasticsearch.indices.recovery and <code>org.elasticsearch.index.gateway for logging messages referring to shards recoveries
  • org.elasticsearch.cluster.action.shard for logging messages related to shard states (failed shards, shards starting etc)
  • org.elasticsearch.snapshots for everything related to snapshots&restore
  • org.elasticsearch.http to monitor for any binding or connectivity/discovery issues between nodes
  • org.elasticsearch.marvel.agent.exporter to monitor the Marvel agent

Slowlog settings at index level

You remember that I mentioned the logging settings can be dynamically changed, without nodes or cluster restart. This is valid for slow logs, as well. Imagine you have several tens of indices in your cluster and you suspect that one of the older indices is performing poorly. You'd want to investigate this index more closely and monitor the query response times. The first approach would be to set the slowlog logging level to DEBUG or TRACE cluster-wide and grep the resulting slowlog file for your index name.

This will work, indeed, but why having to grep for the desired index name? Instead, set the slowlog settings at index level, only for the index you are interested into:

PUT /test_index/_settings
{
  "index": {
    "search.slowlog.level": "trace",
    "search.slowlog.threshold.query.trace": "100ms"
  }
}

The settings above change the slowlog logging level and the threshold for only the  test_index. In this way you can monitor only the queries for that particular index without the overhead of enabling "trace" logging for all indices.

Logging inside Groovy scripts

Another trick that I found myself using several times to debug Groovy scripts in Elasticsearch is logging inside the script itself to double check that the script is doing what I believe it's doing.

For example, you have an index with documents that hold a timestamp and you want to filter your documents considering a certain week of the year. Using a script this goes like this:

{
  "query": {
    "filtered": {
      "filter": {
        "script": {
          "script": "weeks.contains((int)doc['timestamp'].date.toDateTime().getWeekOfWeekyear())",
          "params": {
            "weeks": [
              18,19,20
            ]
          },
          "lang": "groovy"
        }
      }
    }
  }
}

To check if the correct Groovy method for finding out the week of the year is getWeekOfWeekyear(), this logging can be added to the script, to print in the logs the actual value returned by that method:

import  org.elasticsearch.common.logging.*;
ESLogger logger = ESLoggerFactory.getLogger('myscript');
logger.logger.log(org.apache.log4j.Level.INFO, (int)doc['timestamp'].date.toDateTime().getWeekOfWeekyear());

And the final query becomes:

{
  "query": {
    "filtered": {
      "filter": {
        "script": {
          "script": "import  org.elasticsearch.common.logging.*; ESLogger logger=ESLoggerFactory.getLogger('myscript'); logger.logger.log(org.apache.log4j.Level.INFO,(int)doc['timestamp'].date.toDateTime().getWeekOfWeekyear());values.contains((int)doc['timestamp'].date.toDateTime().getWeekOfWeekyear())",
          "params": {
            "values": [
              18,19,20
            ]
          },
          "lang": "groovy"
        }
      }
    }
  }
}


Deprecation Logging in 2.0

Among many other planned goodies in the 2.0 release of Elasticsearch, we also added a new deprecation log to notify users when they use any settings, features, parameters etc which have been deprecated.