Engineering

Alerting and anomaly detection for uptime and reliability

Being able to easily monitor the health of all your sites and services from multiple global locations is a powerful tool for site reliability. However, no one wants to sit and stare at a status dashboard all day. Naturally, teams want to be alerted when there is an issue. We can do that with alerting in Kibana. And when coupled with Elastic machine learning, alerts can be automatically generated from anomalies that are automatically detected. That’s the power of Elastic Observability.

In this blog, we’re going to build on the recent service monitoring and availability made simple post to configure alerting and anomaly detection for sites, services, and APIs. If you haven’t read that blog yet, start there first to learn all about Elastic Uptime and to get your test environment deployed for free in Elastic Cloud.

The speed of Elastic 

To quote one of my favorite movies: Life moves pretty fast. If you don’t stop and look around once in a while, you could miss it. Nowhere is that more true than at Elastic. Since the first part of this blog was updated, Elastic 7.9 was released. Among the many fantastic updates in that release were a few enhancements to Elastic Uptime, including the addition of alerting on anomaly detection, which we will discuss below. Another visual change you will see from the first post is the addition of availability percentages on a monitor’s detail page.

Availability percentages on a monitor detail page

Availability information is now provided as an overall metric as well as on a per-location breakdown. The world map can still be accessed with the geo marker button in the upper right corner.

The world map can still be accessed with the geo marker button in the upper right corner

Status alert

The first type of alert we are going to set up is a Monitor Status alert. This will automatically create alerts based on errors and outages.

From the overview page, select the Alerts dropdown, then select Create alert > Monitor status alert. A flyout where you can create an alert will appear on the right side of the screen.

Create an alert from the Alerts dropdown

Give your alert the name World Wide Status and give it a tag global so we can group it with other alerts we may create later. We want to check the status every five minutes looking for any monitor that shows Down more than 3 times in 10 minutes. When the condition is met, we want the alert triggered. And we only want to be notified every 30 minutes so we’re not constantly flooded with alert notifications.

The Create Alert flyout

This alert covers all the monitors we have running; however, we can easily create alerts for one or more specific monitors and have different conditions for each.

The final step in creating an alert is selecting one or more action to take when the alert is triggered.

Selecting an action type

A lot of us use Slack as our primary channel of communication at work, so let’s create a Slack action. If you don’t have your desired connector created, you will see an option to “Create a connector.”

Setting up a Slack connector

The provided alert message is pretty good, but let’s customize it a little bit and add some Slack formatting. To make that easier, we can select the context dropdown to see a list of available variables we can use to enrich the content of our alert. 

After hitting Save, you’ll see the confirmation!

Confirmation message

Let's break something!

To test out this new alert we can wait for a legitimate site issue ... or we can break something! To be safe, you should only break something that won’t affect others. For this example, I’m going to disable my internal TCP port listener script.

pkill -f echo_port_listener.py

And after a few seconds, we see my internal TCP monitor is reporting down as expected. Try it out for yourself!

A dashboard showing that the internal TCP monitor is down

Let’s click the dropdown on the Overview page to get more information about the monitor.

The Overview page, showing more information about the monitor

You can see that the monitor running on my us-central-c host is Down but the australia-southwest1 and europe-west2 internal TCP listeners are still Up.  You can also see the issue was Connection Refused

Clicking on the monitor we can get additional information. On the world map, the US location is red while London and Australia are still gray (Up). Pings over time shows the counts of Up and Down pings, and in the History section we can expand the Down status and again see the connection refused error.

Pings over time and History in Elastic Uptime

And after a few minutes, I’m notified in Slack that my connection is down.

A Slack notification showing that the connection is down

Certificate alerts

You may have noticed from the Alerts dropdown there is a second alert type, a TLS alert.

I discussed the importance of monitoring your site TLS certificates in the previous post, so let’s set up an alert to notify us when any of the certs on sites we’re monitoring are close to expiring.

The alerting flyout panel is similar to the one we saw in the previous alert. In this case the middle section provides TLS-specific settings. Since we don’t need to check the expiration date of certs as frequently as a status alert, let’s check every 6 hours and notify every 12. The thresholds used in the alert for expiration within range and older than range are configured in the settings page, as they are also used for visual alerts. You can further modify the alert settings however you’d like

The default message looks good, so hit Save. The next time the alert runs, it will pick up one of the sites we’re monitoring that has a certificate expiring soon and will send an alert in Slack. This way we can avoid potentially costly outages to our site (and damage to our reputation).

A Slack notification showing that a certificate is expiring

Anomaly detection

Knowing when your website or endpoint is down is extremely important. If your customer can’t load your webpage, they can’t purchase products or interact with the services you provide. But knowing when your site is slowing down can be just as important. Today, with everyone expecting websites to instantly load, having a slow webpage response time can just as easily drive a visitor to a competitor's website. 

With machine learning integrated right into Elastic Uptime, you can enable anomaly detection with one click right from the Monitor duration panel of each monitor. Each individual monitor location is individually modelled and when a monitor runs for an unusual amount of time at a particular time an anomaly is recorded and highlighted right on the Monitor duration chart.

To enable anomaly detection, simply click on a monitor to go to the details page and in the Monitor duration panel select Enable anomaly detection. That’s it!

Enabling anomaly detection

As mentioned above, one of the new features in the Elastic 7.9 release was the ability to create an alert when a monitor duration is detected. As part of automatically creating an anomaly detection job, an Alert Creation flyout will appear. This will allow you to configure which severity level to create an alert on. The flyout is the same as the alert we created above with the exception of the anomaly section. We will leave the default level of “Critical” to trigger an alert. We will use the same Slack connector, which is pre-populated with an anomaly-specific message.

Uptime duration anomaly

When an anomaly is detected, the duration is represented right on the Monitor duration chart along with the duration times. The colors represent the criticality of the anomaly (red == critical, yellow == minor). In addition, since we configured a Slack alert, when a critical anomaly is detected we will be alerted about it!

Monitor duration chart

The anomaly detection job on the back end is the same as any other custom job you may create. The integration simply allows for ease of creation and viewing the anomalies along with the source data. You can view more information about the job and its history in the ML UI, which is easily accessed from the Anomaly detection dropdown in the Monitor duration screen. Here we can also create an alert (if we decided against it originally) or deactivate the job altogether.

Learn more about Elastic’s anomaly detection in our blog post on identifying anomalies, influencers, and root causes.

Wrap up time

Over these two blog posts we’ve configured Heartbeat to monitor multiple services from multiple global locations. We’ve walked through using Elastic Uptime to easily identify when there are disruptions to those services. We enabled anomaly detection on our monitors with one click, allowing us to easily see when a service is performing unusually slow. And, so we can go about our other work without having to keep an eye on all this, we enabled alerting right from within Uptime so we can be notified when action is needed.   

If you’ve just been reading along, I encourage you to try Uptime out for yourself. The easiest way to get started is to sign up for a free trial of Elasticsearch Service and install Heartbeat on your laptop — you’ll be observing in no time. 

Now that you have Uptime covered, you might be interested in identifying and monitoring key metrics for your hosts and systems.