How to enable Kubernetes alerting with Elastic Observability

In the Kubernetes world, different personas demand different kinds of insights. Developers are interested in granular metrics and debugging information. SREs are interested in seeing everything at once to quickly get notified when a problem occurs and spot where the root cause is. In this post, we’ll focus on alerting and provide an overview of how alerts in Elastic Observability can help users quickly identify Kubernetes problems.

Why do we need alerts?

Logs, metrics, and traces are just the base to build a complete monitoring solution for Kubernetes clusters. Their main goal is to provide debugging information and historical evidence for the infrastructure.

While out-of-the-box dashboards, infrastructure topology, and logs exploration through Kibana are already quite handy to perform ad-hoc analyses, adding notifications and active monitoring of infrastructure allows users to deal with problems detected as early as possible and even proactively take actions to prevent their Kubernetes environments from facing even more serious issues.

How can this be achieved?

By building alerts on top of their infrastructure, users can leverage the data and effectively correlate it to a specific notification, creating a wide range of possibilities to dynamically monitor and observe their Kubernetes cluster.

In this blog post, we will explore how users can leverage Elasticsearch’s search powers to define alerting rules in order to be notified when a specific condition occurs.

SLIs, alerts, and SLOs: Why are they important for SREs?

For site reliability engineers (SREs), the incident response time is tightly coupled with the success of everyday work. Monitoring, alerting, and actions will help to discover, resolve, or prevent issues in their systems.

An SLA (Service Level Agreement) is an agreement you create with your users to specify the level of service they can expect.

An SLO (Service Level Objective) is an agreement within an SLA about a specific metric like uptime or response time.

An SLI (Service Level Indicator) measures compliance with an SLO.

SREs’ day-to-day tasks and projects are driven by SLOs. By ensuring that SLOs are defended in the short term and that they can be maintained in the medium to long term, we lay the basis of a stable working infrastructure.

Having said this, identifying the high-level categories of SLOs is crucial in order to organize the work of an SRE. Then in each category of SLOs, SREs will need the corresponding SLIs that can cover the most important cases of their system under observation. Therefore, the decision of which SLIs we will need demands additional knowledge of the underlying system infrastructure.

One widely used approach to categorize SLIs and SLOs is the Four Golden Signals method. The categories defined are Latency, Traffic, Errors, and Saturation.

A more specific approach is the The RED method developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation category because this one is mainly used for more advanced cases — and people remember better things that come in threes.

Focusing on Kubernetes infrastructure operators, we will consider the following groups of infrastructure SLIs/SLOs:

Group 1: Latency of control plane (apiserver,
Group 2: Resource utilization of the nodes/pods (how much cpu, memory, etc. is consumed)
Group 3: Errors (errors on logs or events or error count from components, network, etc.)

Creating alerts for a Kubernetes cluster

Now that we have a complete outline of our goal to define alerts based on SLIs/SLOs, we will dive into defining the proper alerting. Alerts can be built using Kibana.

See Elastic documentation.

In this blog, we will define more complex alerts based on complex Elasticsearch queries provided by Watcher’s functionality. Read more about Watcher and how to properly use it in addition to the examples in this blog.

Latency alerts

For this kind of alert, we want to define the basic SLOs for a Kubernetes control plane, which will ensure that the basic control plane components can service the end users without an issue. For instance, facing high latencies in queries against the Kubernetes API Server is enough of a signal that action needs to be taken.

Resource saturation

The next group of alerting will be resource utilization. Node’s CPU utilization or changes in Node’s condition is something critical for a cluster to ensure the smooth servicing of the workloads provisioned to run the applications that end users will interact with.

Error detection

Last but not least, we will define alerts based on specific errors like the network error rate or Pods’ failures like the OOMKilled situation. It’s a very useful indicator for SRE teams to either detect issues on the infrastructure level or just be able to notify developer teams about problematic workloads. One example that we will examine later is having an application running as a Pod and constantly getting restarted because it hits its memory limit. In that case, the owners of this application will need to get notified to act properly.

From Kubernetes data to Elasticsearch queries

Having a solid plan about the alerts that we want to implement, it's time to explore the data we have collected from the Kubernetes cluster and stored in Elasticsearch. For this we will consult the list of the available data fields that are ingested using the Elastic Agent Kubernetes integration (the full list of fields can be found here). Using these fields we can create various alerts like:

Node CPU utilization
Node Memory utilization
BW utilization
Pod restarts
Pod CPU/memory utilization

CPU utilization alert

Our first example will use the CPU utilization fields to calculate the Node’s CPU utilization and create an alert. For this alert, we leverage the metrics:

kubernetes.node.cpu.usage.nanocores
kubernetes.node.cpu.capacity.cores.

The following calculation (nodeUsage / 1000000000 ) /nodeCap grouped by node name will give us the CPU utilization of our cluster’s nodes.

The Watcher definition that implements this query can be created with the following API call to Elasticsearch:

curl -X PUT "https://elastic:changeme@localhost:9200/_watcher/watch/Node-CPU-Usage?pretty" -k -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "10m"
    }
  },
  "input": {
    "search": {
      "request": {
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-10m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "data_stream.dataset: kubernetes.node OR data_stream.dataset: kubernetes.state_node",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "kubernetes.node.name",
                "size": "10000",
                "order": {
                  "_key": "asc"
                }
              },
              "aggs": {
                "nodeUsage": {
                  "max": {
                    "field": "kubernetes.node.cpu.usage.nanocores"
                  }
                },
                "nodeCap": {
                  "max": {
                    "field": "kubernetes.node.cpu.capacity.cores"
                  }
                },
                "nodeCPUUsagePCT": {
                  "bucket_script": {
                    "buckets_path": {
                      "nodeUsage": "nodeUsage",
                      "nodeCap": "nodeCap"
                    },
                    "script": {
                      "source": "( params.nodeUsage / 1000000000 ) / params.nodeCap",
                      "lang": "painless",
                      "params": {
                        "_interval": 10000
                      }
                    },
                    "gap_policy": "skip"
                  }
                }
              }
            }
          }
        },
        "indices": [
          "metrics-kubernetes*"
        ]
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.nodes.buckets": {
        "path": "nodeCPUUsagePCT.value",
        "gte": {
          "value": 80
        }
      }
    }
  },
  "actions": {
    "log_hits": {
      "foreach": "ctx.payload.aggregations.nodes.buckets",
      "max_iterations": 500,
      "logging": {
        "text": "Kubernetes node found with high CPU usage: {{ctx.payload.key}} -> {{ctx.payload.nodeCPUUsagePCT.value}}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Node CPU Usage"
  }
}

OOMKilled Pods detection and alerting

Another Watcher that we will explore is the one that detects Pods that have been restarted due to an OOMKilled error. This error is quite common in Kubernetes workloads and is useful to detect this early on to inform the team that owns this workload, so they can either investigate issues that could cause memory leaks or just consider increasing the required resources for the workload itself.

This information can be retrieved from a query like the following:

kubernetes.container.status.last_terminated_reason: OOMKilled

Here is how we can create the respective Watcher with an API call:

curl -X PUT "https://elastic:changeme@localhost:9200/_watcher/watch/Pod-Terminated-OOMKilled?pretty" -k -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-1m",
                      "lte": "now",
                      "format": "strict_date_optional_time"
                    }
                  }
                },
                {
                  "bool": {
                    "must": [
                      {
                        "query_string": {
                          "query": "data_stream.dataset: kubernetes.state_container",
                          "analyze_wildcard": true
                        }
                      },
                      {
                        "exists": {
                          "field": "kubernetes.container.status.last_terminated_reason"
                        }
                      },
                      {
                        "query_string": {
                          "query": "kubernetes.container.status.last_terminated_reason: OOMKilled",
                          "analyze_wildcard": true
                        }
                      }
                    ],
                    "filter": [],
                    "should": [],
                    "must_not": []
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": []
            }
          },
          "aggs": {
            "pods": {
              "terms": {
                "field": "kubernetes.pod.name",
                "order": {
                  "_key": "asc"
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.pods.buckets": {
        "path": "doc_count",
        "gte": {
          "value": 1,
          "quantifier": "some"
        }
      }
    }
  },
  "actions": {
    "ping_slack": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/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",
        "body": "{\"channel\": \"#k8s-alerts\", \"username\": \"k8s-cluster-alerting\", \"text\": \"Pod {{ctx.payload.key}} was terminated with status OOMKilled.\"}"
      }
    }
  },
  "metadata": {
    "xpack": {
      "type": "json"
    },
    "name": "Pod Terminated OOMKilled"
  }
}

From Kubernetes data to alerts summary

So far we saw how we can start from plain Kubernetes fields, use them in ES queries, and build Watchers and alerts on top of them.

One can explore more possible data combinations and build queries and alerts following the examples we provided here. A full list of alerts is available, as well as a basic scripted way of installing them.

Of course, these examples come with simple actions defined that only log messages into the Elasticsearch logs. However, one can use more advanced and useful outputs like Slack’s webhooks:

"actions": {
    "ping_slack": {
      "foreach": "ctx.payload.aggregations.pods.buckets",
      "max_iterations": 500,
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/T04SW3JHXasdfasdfasdfasdfasdf",
        "body": "{\"channel\": \"#k8s-alerts\", \"username\": \"k8s-cluster-alerting\", \"text\": \"Pod {{ctx.payload.key}} was terminated with status OOMKilled.\"}"
      }
    }
  }

The result would be a Slack message like the following:

Next steps

In our next steps, we would like to make these alerts part of our Kubernetes integration, which would mean that the predefined alerts would be installed when users install or enable the Kubernetes integration. At the same time, we plan to implement some of these as Kibana’s native SLIs, providing the option to our users to quickly define SLOs on top of the SLIs through a nice user interface. If you’re interested to learn more about these, follow the public GitHub issues for more information and feel free to provide your feedback:

For those who are eager to start using Kubernetes alerting today, here is what you need to do:

Make sure that you have an Elastic cluster up and running. The fastest way to deploy your cluster is to spin up a free trial of Elasticsearch Service.
Install the latest Elastic Agent on your Kubernetes cluster following the respective documentation.
Install our provided alerts that can be found at https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs or at https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting.

Of course, if you have any questions, remember that we are always happy to help on the Discuss forums.