Tech Topics

Elasticsearch Command Line Debugging With The _cat API

One of the most useful utilities for investigating Elasticsearch from the command line is the _cat API. Whereas the usual Elasticsearch API endpoints are ideal for consuming JSON from within a fully-fledged programming language, the cat API (as its name would imply) is especially suited for command-line tools.

In previous posts we've explored some of the different endpoints for the API. We can build upon that and the existing plethora of command-line utilities to build even more useful patterns that we can combine for simple (and effective) monitoring and debugging use cases.

A Primer for your Cat

A basic familiarity with the cat API is helpful before reading on. In particular:

  • The h query string parameter allows us to ask the API to only return certain fields that we're interested in (i.e., the header)
  • Note that the -s argument is used in curl - this is to enable silent output, otherwise extraneous HTTP transfer data may get into the pipeline when chaining commands together.

Diving Into The Heap

An oft-asked question is how to debug OutOfMerroryError messages. You've taken the right measures to configure your heap, but after some time of normal use, heap usage grows again and instability ensues. How can you dig further?

There are lots of good resources to track this sort of resource utilization: Marvel offers commercial monitoring, and there's many of open source options as well. However, when you're debugging a red cluster at the last minute, you need immediate options. What tools can you reach for easily?

The cat API offers many endpoints, and piping a curl command which retrieves heap metrics into sort can quickly answer the question, "Which node is experiencing the most memory pressure right now?"

$ curl -s 'localhost:9200/_cat/nodes?h=host,heap.percent' | sort -r -n -k2
es02  71
es00  60
es01  59

We can see that node es02 is using 71% of the JVM heap. Following the pipeline:

  • We ask for the nodes endpoint, querying the hostname and heap percentage in use, then
  • Pipe to sort, returning the second column (heap percentage in use) by -r (reverse) order to get the highest in use first.

Coupled with other utilities like head and tail, we can find both over- and under-utilitized nodes in very large clusters very quickly.

This is useful, but we can do more. It would be nice if we could query heap usage at a more granular level in order to determine what, exactly, is using space on our nodes.

It turns out we can:

$ curl -s 'localhost:9200/_cat/nodes?h=name,fm,fcm,sm,qcm,im&v'
name       fm     fcm      sm
es01  781.4mb 675.6mb 734.5mb
es02    1.6gb 681.3mb 892.2mb
es00    1.4gb 620.1mb 899.4mb

Note: These are abbreviated column names for fielddata.memory_size (fm), filter_cache.memory_size (fcm), and segments.memory (sm). Other fields exist as well, consult curl -s 'localhost:9200/_cat/nodes?help' | grep -i memory for additional information.

From this we can see that fielddata.memory_size is consuming a fairly large part of our node's memory. Armed with this knowledge, mitigations such as increased use of doc_values can aid in shrinking that aspect of heap usage.

Gazing Into The Thread Pool

Many Elasticsearch operations take place in thread pools, which are useful when inspecting what your cluster is busy doing. During peak times, the thread pool can be a useful reflection of what operations (searching, indexing, etc.) are keeping machines busy.

The cat API is useful here, too. By default it returns common thread pools' active, queued, and rejected pools, which can often help pinpoint requests that are backing up into queued pools under heavy load. Consider this generic output:

$ curl -s 'localhost:9200/_cat/thread_pool'
es03  10.xxx.xx.xxx 0 0 0 0 0 0 1 0 0
elk00 10.xx.xxx.xxx 0 0 0 0 0 0 1 0 0
es00  10.xx.xx.xxx  0 0 0 0 0 0 0 0 0

Note: the table headers are omitted here, but the numbers following node IPs are the active, queue, and rejected pools for the bulk, index, and search thread pools, respectively.

This cluster is serving a single search request, which isn't terribly exciting. However, what if your cluster is having problems and you need to closely watch operations? A watch command can help here:

$ watch 'curl -s localhost:9200/_cat/thread_pool | sort -n -k1'

watch executes the command every 2 seconds by default. We sort on the first column to keep ordering consistent, and watch usefully highlights field values as they change so we can keep a close eye on thread pools as they change, so it's easy to spot problems such as a deluge of search requests getting queued if users are hitting the cluster hard.

Diffing Indices for Fun and Profit

Another common use case is migrating data from one cluster to another: there are several ways to do this including snapshots and utilities like logstash. With large datasets, this can take quite some time, so how can we gain visibility into the process?

The cat API offers simple endpoints for index metrics through _cat/indices which includes information such as index disk usage and document count per index.

Given a scenario in which we're streaming documents from one index to another on a different cluster, we can perform some command-line gymnastics to watch a diff between index document counts. Consider the following command:

$ join <(curl -s localhost:9200/_cat/indices | awk '$3 ~ /foo-index/ { print $3 " " $6; }') <(curl -s otherhost:9200/_cat/indices | awk '$3 ~ /foo-index/ { print $3 " " $6; }') | awk '{ print $1 " " $2 - $3; }'
foo-index -231700

This command makes use of bash process substitution in order to create temporary file descriptors from the output of commands between the <() parenthesis. We use awk to find the index of interest, foo-index, and print the index name and document count for the local index and the remote one we're streaming to. Using the join command on the host field merges them into one line, and we pipe the results through another awk in order to calculate a difference between the two indices.

The results of the example command indicate that there's a disparity in document count that should converge as the stream nears completion and finishes replicating the data.

By placing this command into either a watch command or bash for loop, we can quickly whip up a small utility to watch the progress of our index import.

Going Further

These are just a few examples of how everyday command line utilities can be paired with the cat API to make lives easier for administering Elasticsearch. There are plenty of other potential applications to consider - for example:

  • Use the _cat/recovery API to watch the recovery progress of a cluster in a shell loop
  • Consume metrics like thread pools or node statistics in tools like ganglia or nagios for simple alerting on Elasticsearch health
  • Use _cat/shards when in a pinch to find exactly which shard is causing your cluster to go into a red or yellow state

Good luck, and may the cat be with you!