Engineering

# Watching the watches: Writing, debugging and testing watches

Editor's Note (September 7, 2018): This post refers to X-Pack. Starting with the 6.3 release, the X-Pack code is now open and fully integrated as features into the Elastic Stack.

TLDR; This blog post will outline how to efficiently write, execute, and debug watches. This will be useful if you are writing your own watches or if you have to debug watches written by your colleagues.

## The correct workflow

A watch is the unit of work for the alerting features in X-Pack. A watch contains information on where to get data from that could trigger the alert, a condition to trigger upon, which actions to take when that condition holds true, and how often the watch should be executed. Let’s say we want to add a watch to our system and see if it triggers how we want it to. Let's go with the most simple watch we can imagine:

PUT _xpack/watcher/watch/my-watch
{
"trigger" : {
"schedule" : {
"interval": "5m"
}
},
"input" : {
"simple" : {
"foo": "bar"
}
},
"actions" : {
"log-me-action" : {
"logging"  : {
}
}
}
}


This is the most barebones watch. We have to define a trigger, an input and a single action. If you do not specify a condition, an always condition will be used, so the actions are always executed.

Now you have two choices. First, you could wait for the scheduled execution run to check if everything works as expected. This, however, would make your testing cycle viciously slow, so you should not do that. Instead you could use the Execute Watch API to execute the watch manually like this:

POST _xpack/watcher/watch/my-watch/_execute


You will get a longish output that shows the execution path of the watch. In this case, it will simply return the configured input. However, a search or a webhook might return the HTTP response from the destination host. You will see information if the condition was met, and lastly you will see for each action, if it was executed and its return value. This entry is exactly the same as the one that is stored in the watch history, when the watch is triggered automatically.

However this workflow still has a weakness: you have to execute two calls. First you store the watch then you execute the watch. In order to overcome this, the Execute Watch API allows you to specify a watch like this

PUT _xpack/watcher/watch/_execute
{
"watch" : {
"trigger" : {
"schedule" : {
"interval": "5m"
}
},
"input" : {
"simple" : {
"foo": "bar"
}
},
"actions" : {
"log-me-action" : {
"logging"  : {
}
}
}
}
}


When you check the Execute Watch API response, you will see that the watch_record.watch_id field of the watch is referred to as inlined because we supplied the watch with the request. This saves you an additional round-trip of creating a watch and executing it. But this is just the first step! Let's do a quick excursion into the watch history, and then we'll come back for more and better testing.

## A journey into the watch history

The watch history is stored using time-based indices that are created automatically by Watcher on a daily basis. Whenever a watch is executed, a corresponding watch history entry is created for the execution. Let's examine a sample watch history for the above watch, when we ran the Execute Watch API.

{
"_id" : "_inlined__7310b96e-b693-4092-b54f-c54cf178fde5-2017-03-06T13:13:35.467Z",
"watch_record" : {
"watch_id" : "_inlined_",
"state" : "executed",
"trigger_event" : {
"type" : "manual",
"triggered_time" : "2017-03-06T13:13:35.467Z",
"manual" : {
"schedule" : {
"scheduled_time" : "2017-03-06T13:13:35.467Z"
}
}
},
"input" : {
"simple" : {
"foo" : "bar"
}
},
"condition" : {
"always" : { }
},
"result" : {
"execution_time" : "2017-03-06T13:13:35.467Z",
"execution_duration" : 2,
"input" : {
"type" : "simple",
"status" : "success",
"foo" : "bar"
}
},
"condition" : {
"type" : "always",
"status" : "success",
"met" : true
},
"actions" : [
{
"id" : "log-me-action",
"type" : "logging",
"status" : "success",
"logging" : {
}
}
]
},
"messages" : [ ]
}
}


The state field indicates whether the watch was executed successfully. Usually you will see the executed state. What else could you see? The state could indicate that a watch was acknowledged, throttled, the condition was not met or that the dedicated threadpool and its queue were filled up when the watch tried to be executed.

The trigger_event field contains information on what triggered this watch, which is usually a schedule-based trigger (unless you use the execute watch API yourself).

The input structure only contains the information defined in our input. This is to ensure that you have a way to exactly know what was requested for this execution of the watch, as this information would disappear when you update your watch..

The condition structure in this concrete example is just expanded to an always condition. Again the same applies as with the input field: this ensures you know the condition of the watch at the time of the execution, even if you change it later on.

Now on the interesting part: the result field. For each execution, regardless of input, condition, or action, the data gathered is logged. Let's take a look at the simple input result first

"input" : {
"type" : "simple",
"status" : "success",
"foo" : "bar"
}
},


This output does not yield any further information and thus is a bit boring, but so is the simple input. However, the output of the http input is vastly different

"input" : {
"http" : {
"request": {
"url" : "http://localhost:9200/_cluster/health"
}
}
}


With this input, the output is much more verbose

"result" : {
...
"input" : {
"type" : "http",
"status" : "success",
"cluster_name" : "elasticsearch",
"active_shards" : 3,
"active_primary_shards" : 3,
"unassigned_shards" : 3,
"delayed_unassigned_shards" : 0,
"timed_out" : false,
"relocating_shards" : 0,
"content-length" : [
"389"
],
"content-type" : [
"application/json; charset=UTF-8"
]
},
"initializing_shards" : 0,
"number_of_data_nodes" : 1,
"number_of_in_flight_fetch" : 0,
"active_shards_percent_as_number" : 50.0,
"_status_code" : 200,
"status" : "yellow",
"number_of_nodes" : 1
},
"http" : {
"request" : {
"host" : "localhost",
"port" : 9200,
"scheme" : "http",
"method" : "get",
"path" : "/_cluster/health"
},
"status_code" : 200
}
}
}


As you can see, the payload field, which you can access via ctx.payload in the condition and the actions, has been enriched with the JSON response from a cluster health request, along with a few more fields like _status_code and the HTTP response headers in _headers. Also there is a generic status field for each input, that indicates if the input execution was successful.

Next is the condition in the result. Again, our omitted condition is somewhat boring, as this only marks the condition being met due to always returning true. However let’s take a look at this scripted condition, which is only met if the cluster health API reports more than three nodes

"condition" : {
"script" : {
"inline" : "return ctx.payload.number_of_nodes > 3",
"lang" : "painless"
}
}


As we run a single node cluster, this condition will not match and this is what the output looks like.

"condition" : {
"type" : "script",
"status" : "success",
"met" : false
}


The output status is marked as success because the script itself was executed successfully. A nice feature of the painless scripting language are its exact stack traces with spot-on error messages. As we do not want to to lose this information, in case you made a mistake the exception field in the watch history will be enriched with the whole stack trace. For example, specifying a non-existing field as in return ctx.payload.numbr_of_nodes > 3 will produce an additional script_stack field, that contains the exact location where the script failed.

"script_stack" : [
"                  ^---- HERE"
],


If you are interested in further painless debugging, you should take a look at the painless documentation.

The next part of the result field contains information for each action being executed. In our simple example, we only see the text that is also logged into the Elasticsearch log file. However, when your output becomes a little more complex, there is a pile of useful information in there. Let's use the index action as an example:

"actions" : {
"index-me-action" : {
"index"  : {
"index" : "foo",
"doc_type" : "bar"
}
}
}


This action yields the following output

"actions" : [
{
"id" : "index-me-action",
"type" : "index",
"status" : "success",
"index" : {
"response" : {
"created" : true,
"result" : "created",
"id" : "AVqj7nBXM9pxFWYxpNpM",
"version" : 1,
"type" : "bar",
"index" : "foo"
}
}
}
]


Last but not least, the result field also contains information such as when the watch was executed in the execution_time as well as the execution duration in the execution_duration field in milliseconds.

## Using alternative inputs to test time-based queries

Now we know how to decipher the watch history and how to run a watch, but how can we actually debug watches? Also, we can still improve our workflow to create a watch. Let's imagine we are writing a watch to monitor disk space used from Metricbeat data. In order to detect machines that are running low on disk space, we need to search for documents containing disk space information which have been created in the last five minutes and contain a disk usage of more than 90%. Then we want to aggregate the results per host. Note, that this is a very simple example, you actually may want to aggregate per host and per mount-point to get more exact alerts and/or better filter capabilities.

The query might look similar to the following one

GET metricbeat-2017.03.06/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"from": "now-5m"
}
}
},
{
"term": {
"metricset.name": "filesystem"
}
},
{
"range": {
"system.filesystem.used.pct": {
"from": 0.9
}
}
}
]
}
},
"aggs": {
"by_host": {
"terms": {
"field": "beat.hostname",
"size": "100"
}
}
}
}


The response might be something like this:

{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 60,
"max_score": 0,
"hits": []
},
"aggregations": {
"by_host": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "orca",
"doc_count": 60
},
{
"key": "dolphin",
"doc_count": 120
}
]
}
}
}


In this example two hosts have a fair share of documents above the threshold. So, we could go easy and add the watch around the written query now. However, in order to properly test our watch, we would need to test the case of system being low on disk space as well as the case where everything is fine. Therefore we need a way to emulate the above input. Watcher supports exactly this by using an alternative_input in the execute watch API. Again we are using the Execute Watch API without storing the watch - as we are still not sure, whether this is going to be our final watch.

POST _xpack/watcher/watch/_execute
{
"alternative_input" : {
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 60,
"max_score": 0,
"hits": []
},
"aggregations": {
"by_host": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "orca",
"doc_count": 60
},
{
"key": "dolphin",
"doc_count": 120
}
]
}
}
},
"watch" : {
"trigger" : {
"schedule" : {
"interval": "5m"
}
},
"input" : {
"search" : {
"request": {
"indices" : [ "<metricbeat-{now/d}>"],
"body" : {
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"from": "{{ctx.trigger.scheduled_time}}||-5m"
}
}
},
{
"term": {
"metricset.name": "filesystem"
}
},
{
"range": {
"system.filesystem.used.pct": {
"from": 0.5
}
}
}
]
}
},
"size": 0,
"aggs": {
"by_host": {
"terms": {
"field": "beat.hostname",
"size": "100"
}
}
}
}
}
}
},
"condition" : {
"script" : "return ctx.payload.aggregations.by_host.buckets.size() > 0"
},
"actions" : {
"logme-me-action" : {
"transform" : {
"script" : {
"inline" : "return [ 'hosts' : ctx.payload.aggregations.by_host.buckets.stream().map(b -> b.key).collect(Collectors.toList())]"
}
},
"logging"  : {
"text" : "Over the limit: {{#join delimiter=', '}}ctx.payload.hosts{{/join delimiter=', '}}"
}
}
}
}
}


Great, we were now able to test a watch with arbitrary input data without actually executing the query. There are two things, that you might want to improve. First, this mechanism still requires you to execute a query and then paste the response of that query into the alternative_input field. We can do better than this by taking over more control of the query. The timestamp for the range query we used, contained the scheduled time of the watch execution as a filter. So, if we can control the scheduled time from the outside, there is no need to control the whole input, we could just go back in time by changing the scheduled time. By removing the alternative_input field and adding trigger_data to the execute watch API request you can achieve exactly that

"trigger_data" : {
"scheduled_time" : "2017-03-06T12:34:56.055Z"
}


This way, the specified scheduled time will be used. Note, that we are still using now/d in the index name (which in addition could also roll over to the next day), so this is not yet completely fool proof, but a good start. In case you are wondering what the difference between scheduled_time and triggered_time is. The first is the time, when the watch was supposed to be run (i.e. in five minutes), whereas the second one is the time, when the watch was triggered to be executed (i.e. in five minutes fifteen seconds).

Reading the currently set up watch, there is one more issue to solve. Do we just want to log data in our action configuration? I do not think so. But if we add an email action, do you really want to send out an email to the person on-call right now? Maybe not, so we need to find a way to check if an action would be executed, without executing the action. Let's go back to our index action from the earlier example and simulate it

## Simulating actions

Remember our nice little index action? This one

"actions" : {
"index-me-action" : {
"index"  : {
"index" : "foo",
"doc_type" : "bar"
}
}
}


Running this with the execute watch API creates a new document for every execution. This might skew metrics or run times, so what if we could simulate this? It turns out we can solve this problem by specifying the action mode as part of the execute watch API:

"action_modes" : {
"index-me-action" : "simulate"
}


If you check the result for the action in the response now, you will see that the action got a status called simulated. This means, the request has not been sent but we know the action was triggered. Please see the documention for further action modes, which might be needed to circumvent throttling during testing.

## Using the croneval tool

There is one last helper for the cron interval schedule, that might save you a ton of headaches. There is a small CLI tool in the x-pack/bin/ directory called croneval. This allows you to supply a cron expression and see when it matches next. Take this example

./bin/x-pack/croneval -c 4 '0 5 9 ? * 6#1'
Valid!
Now is [Mon, 6 Mar 2017 15:30:02]
Here are the next 4 times this cron expression will trigger:
1.    Fri, 7 Apr 2017 11:05:00
2.    Fri, 5 May 2017 11:05:00
3.    Fri, 2 Jun 2017 11:05:00
4.    Fri, 7 Jul 2017 11:05:00


## Final words

I hope you have enjoyed our journey into alerting. We still have not covered everything (think of conditions per action, transforms per action and in general, the possibility to ignore the condition when executing a watch via the execute watch API, acking & throttling, which also is taken into account when simulating actions). We might leave this for a more advanced post in the future. Happy alerting!

Last but not least, with the release of the Elastic Stack 5.4, there is also a new Watcher UI, that you definitely should check out!