Watching Marvel Data

If you use Marvel to monitor your Elasticsearch deployment, you can set up watches to take action when something out of the ordinary occurs. For example, you could set up watches to alert on:

These watches query the index where your cluster’s Marvel data is stored. If you don’t have Marvel installed, the queries won’t return any results, the conditions evaluate to false, and no actions are performed.

Watching Cluster Health

This watch checks the cluster health once a minute and takes action if the cluster state has been red for the last 60 seconds:

  • The watch schedule is set to execute the watch every minute.
  • The watch input gets the most recent cluster status from the .marvel-es-1-* indices.
  • The watch condition checks the cluster status to see if it’s been red for the last 60 seconds.
  • The watch action is to send an email. (You could also call a webhook or store the event.)
PUT _watcher/watch/cluster_red_alert
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": ".marvel-es-1-*",
        "types": "cluster_state",
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-2m",
                      "lte": "now"
                    }
                  }
                },
                {
                  "terms": {
                    "cluster_state.status": ["green", "yellow", "red"]
                  }
                }
              ]
            }
          },
          "_source": [
            "cluster_state.status"
          ],
          "sort": [
            {
              "timestamp": {
                "order": "desc"
              }
            }
          ],
          "size": 1,
          "aggs": {
            "minutes": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "5s"
              },
              "aggs": {
                "status": {
                  "terms": {
                    "field": "cluster_state.status",
                    "size": 3
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "throttle_period": "30m", 
  "condition": {
    "script": {
      "inline": "if (ctx.payload.hits.total < 1) return false; def rows = ctx.payload.hits.hits; if (rows[0]._source.cluster_state.status != 'red') return false; if (ctx.payload.aggregations.minutes.buckets.size() < 12) return false; def last60Seconds = ctx.payload.aggregations.minutes.buckets[-12..-1]; return last60Seconds.every { it.status.buckets.every { s -> s.key == 'red' }}"
    }
  },
  "actions": {
    "send_email": { 
      "email": {
        "to": "<username>@<domainname>", 
        "subject": "Watcher Notification - Cluster has been RED for the last 60 seconds",
        "body": "Your cluster has been red for the last 60 seconds."
      }
    }
  }
}

The throttle period prevents notifications from being sent more than once every 30 minutes. You can change the throttle period to receive notifications more or less frequently.

To send email notifications, you must configure at least one email account in elasticsearch.yml. See Configuring Email Services for more information.

Specify the email address you want to notify.

This example uses an inline script, which requires you to enable dynamic scripting in Elasticsearch. While this is convenient when you’re experimenting with Watcher, in a production environment we recommend disabling dynamic scripting and using file scripts.

Watching Memory Usage

This watch runs every minute and takes action if a node in the cluster has averaged 75% or greater heap usage for the past 60 seconds.

  • The watch schedule is set to execute the watch every minute.
  • The watch input gets the average jvm.mem.heap_used_percent for each node from the .marvel-es-1-* indices.
  • The watch condition checks to see if any node’s average heap usage is 75% or greater.
  • The watch action is to send an email. (You could also call a webhook or store the event.)
PUT _watcher/watch/mem_watch
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [
          ".marvel-es-1-*"
        ],
        "types" : [
          "node_stats"
        ],
        "body": {
          "size" : 0,
          "query": {
            "bool": {
              "filter": {
                "range": {
                  "timestamp": {
                    "gte": "now-2m",
                    "lte": "now"
                  }
                }
              }
            }
          },
          "aggs": {
            "minutes": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "minute"
              },
              "aggs": {
                "nodes": {
                  "terms": {
                    "field": "source_node.name",
                    "size": 10,
                    "order": {
                      "memory": "desc"
                    }
                  },
                  "aggs": {
                    "memory": {
                      "avg": {
                        "field": "node_stats.jvm.mem.heap_used_percent"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "throttle_period": "30m", 
  "condition": {
    "script":  "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.memory && node.memory.value >= 75;"
  },
  "actions": {
    "send_email": {
      "transform": {
        "script": "def latest = ctx.payload.aggregations.minutes.buckets[-1]; return latest.nodes.buckets.findAll { return it.memory && it.memory.value >= 75 };"
      },
      "email": { 
        "to": "<username>@<domainname>", 
        "subject": "Watcher Notification - HIGH MEMORY USAGE",
        "body": "Nodes with HIGH MEMORY Usage (above 75%):\n\n{{#ctx.payload._value}}\"{{key}}\" - Memory Usage is at {{memory.value}}%\n{{/ctx.payload._value}}"
      }
    }
  }
}

The throttle period prevents notifications from being sent more than once every 30 minutes. You can change the throttle period to receive notifications more or less frequently.

To send email notifications, you must configure at least one email account in elasticsearch.yml. See Configuring Email Services for more information.

Specify the email address you want to notify.

This example uses an inline script, which requires you to enable dynamic scripting in Elasticsearch. While this is convenient when you’re experimenting with Watcher, in a production environment we recommend disabling dynamic scripting and using file scripts.

Watching CPU Usage

This watch runs every minute and takes action if a node in the cluster has averaged 75% or greater CPU usage for the past 60 seconds.

  • The watch schedule is set to execute the watch every minute.
  • The watch input gets the average CPU percentage for each node from the .marvel-es-1-* indices.
  • The watch condition checks to see if any node’s average CPU usage is 75% or greater.
  • The watch action is to send an email. (You could also call a webhook or store the event.)
PUT _watcher/watch/cpu_usage
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [
          ".marvel-es-1-*"
        ],
        "types" : [
          "node_stats"
        ],
        "body": {
          "size" : 0,
          "query": {
            "filtered": {
              "filter": {
                "range": {
                  "timestamp": {
                    "gte": "now-2m",
                    "lte": "now"
                  }
                }
              }
            }
          },
          "aggs": {
            "minutes": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "minute"
              },
              "aggs": {
                "nodes": {
                  "terms": {
                    "field": "source_node.name",
                    "size": 10,
                    "order": {
                      "cpu": "desc"
                    }
                  },
                  "aggs": {
                    "cpu": {
                      "avg": {
                        "field": "node_stats.process.cpu.percent"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "throttle_period": "30m", 
  "condition": {
    "script":  "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.cpu && node.cpu.value >= 75;"
  },
  "actions": {
    "send_email": { 
      "transform": {
        "script": "def latest = ctx.payload.aggregations.minutes.buckets[-1]; return latest.nodes.buckets.findAll { return it.cpu && it.cpu.value >= 75 };"
      },
      "email": {
        "to": "user@example.com", 
        "subject": "Watcher Notification - HIGH CPU USAGE",
        "body": "Nodes with HIGH CPU Usage (above 75%):\n\n{{#ctx.payload._value}}\"{{key}}\" - CPU Usage is at {{cpu.value}}%\n{{/ctx.payload._value}}"
      }
    }
  }
}

The throttle period prevents notifications from being sent more than once every 30 minutes. You can change the throttle period to receive notifications more or less frequently.

To send email notifications, you must configure at least one email account in elasticsearch.yml. See Configuring Email Services for more information.

Specify the email address you want to notify.

This example uses an inline script, which requires you to enable dynamic scripting in Elasticsearch. While this is convenient when you’re experimenting with Watcher, in a production environment we recommend disabling dynamic scripting and using file scripts.

Watching Open File Descriptors

This watch runs once a minute and takes action if there are nodes that are using 80% or more of the available file descriptors.

  • The watch schedule is set to execute the watch every minute.
  • The watch input gets the average number of open file descriptors on each node from the .marvel-es-1-* indices. The input search returns the top ten nodes with the highest average number of open file descriptors.
  • The watch condition checks the cluster status to see if any node’s average number of open file descriptors is 80% or greater.
  • The watch action is to send an email. (You could also call a webhook or store the event.)
PUT _watcher/watch/open_file_descriptors
{
  "metadata": {
    "system_fd": 65535,
    "threshold": 0.8
  },
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [
          ".marvel-es-1-*"
        ],
        "types": "node_stats",
        "body": {
          "query": {
            "size" : 0,
            "filtered": {
              "filter": {
                "range": {
                  "timestamp": {
                    "gte": "now-1m",
                    "lte": "now"
                  }
                }
              }
            }
          },
          "aggs": {
            "minutes": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "5s"
              },
              "aggs": {
                "nodes": {
                  "terms": {
                    "field": "source_node.name",
                    "size": 10,
                    "order": {
                      "fd": "desc"
                    }
                  },
                  "aggs": {
                    "fd": {
                      "avg": {
                        "field": "node_stats.process.open_file_descriptors"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "throttle_period": "30m", 
  "condition": {
    "script": "if (ctx.payload.aggregations.minutes.buckets.size() == 0) return false; def latest = ctx.payload.aggregations.minutes.buckets[-1]; def node = latest.nodes.buckets[0]; return node && node.fd && node.fd.value >= (ctx.metadata.system_fd * ctx.metadata.threshold);"
  },
  "actions": {
    "send_email": { 
      "transform": {
        "script": "def latest = ctx.payload.aggregations.minutes.buckets[-1]; return latest.nodes.buckets.findAll({ return it.fd && it.fd.value >= (ctx.metadata.system_fd * ctx.metadata.threshold) }).collect({ it.fd.percent = Math.round((it.fd.value/ctx.metadata.system_fd)*100); it });"
      },
      "email": {
        "to": "<username>@<domainname>", 
        "subject": "Watcher Notification - NODES WITH 80% FILE DESCRIPTORS USED",
        "body": "Nodes with 80% FILE DESCRIPTORS USED (above 80%):\n\n{{#ctx.payload._value}}\"{{key}}\" - File Descriptors is at {{fd.value}} ({{fd.percent}}%)\n{{/ctx.payload._value}}"
      }
    }
  }
}

The throttle period prevents notifications from being sent more than once a minute. You can change the throttle period to receive notifications more or less frequently.

To send email notifications, you must configure at least one email account in elasticsearch.yml. See Configuring Email Services for more information.

Specify the email address you want to notify.

This example uses an inline script, which requires you to enable dynamic scripting in Elasticsearch. While this is convenient when you’re experimenting with Watcher, in a production environment we recommend disabling dynamic scripting and using file scripts.