What’s the first thing that comes to your mind when you think about failing services? The adrenaline flowing through your body while you try to stay calm and find the root cause of the issue. The anger caused by not being able to find even a single hint that could help you. The disbelief when you finally found each piece of the domino bricks that fell and lead to exactly this problem. The relief when you fix it and relax all the muscles you have been tensing for the past twenty minutes.
Introduction to Icinga
Icinga is a tool that actively monitors your hosts and services. The results of these checks can be stored in a database and are used to decide who will get notified about which problem. Instead of creating static configuration, you add rules that will create your checks dynamically based on the characteristics of your hosts. By using the RESTful API changes can be made automatically and from remote. The support for zones enables you to distribute your monitoring environment and scale it as your infrastructure grows.
The Icinga API allows modifications of the running configuration and also publishes lots of information about almost everything. Icingabeat is a Beat that fetches data from the Icinga API and forwards it to Logstash or Elasticsearch. It supports two modes that run independently and collect various information.
Collecting Icinga events is one of the features of Icingabeat. Events are results of your service checks, notifications that have been sent, triggered downtimes or created comments. There are plenty of other event types. Basically, an event is something that happened inside Icinga. The Icinga API transports these events through an HTTP stream. Every user with sufficient permissions can read them. Users can also be granted to see only some of these events. Types and filters let you configure exactly which events you want to store in Elasticsearch.
The status poller is the second feature of Icingabeat. It runs independently from the event stream and polls periodically status information about the Icinga daemon itself. This information gives you an insight about how your monitoring environment is performing.
Correlating Logs with Monitoring Data
When you are notified about a problem, there’s a high chance that you probably won’t know what’s the exact cause of the issue just by reading the notification. The typical procedure of finding the root cause of a failing service includes searching for answers in the logs. Monitoring data and logging data are two powerful sources full of information about current activities in your environment.
Storing your monitoring data at the same place as your logging data, in Elasticsearch, allows you to perform searches on both sources and view the results in one dashboard. You can search for a specific host and see in a timeline which errors were logged and what services were affected by this at the same time. You can find out who was notified about the issue and when they started responding to it.
When collecting log events with Filebeat, they will be stored in a slightly different structure. Therefore, it makes sense to create separate visualisations for each purpose. One table could show all authentication logs and another list the failing services detected by Icinga. Afterwards you can combine the visualizations in one dashboard and search for hosts.
The Filebeat data, you would query like this
type:log AND source:”/var/log/auth.log”
type:icingabeat.event.checkresult AND NOT check_result.state:0
Being notified about problems in your infrastructure is mandatory and helps you to keep the business up and running. Receiving false positives too often, however, will make you less attentive for those notifications. Furthermore, too many false positives will make your monitoring useless for you and everyone who uses it. Finding out which services cause the most noise can help you identify what is wrong with your monitoring. The statistics about which host or service sent the most notices to which person tell you everything you need to know on which changes you need to make to reduce the noise.
Monitor the Monitor
Have you ever wondered, how your monitoring system actually performs? How many checks does it run per minute? How many agents were added during the last month? To decide if you need more capacity in your monitoring environment, you need the knowledge about how your current setup is doing. Icingabeat collects all information about your running Icinga instances from their API. It helps you to plan your capacities, but also spot misbehaviours. If, for example, the average execution time is suddenly rising, you probably want to review that monitoring plugin you just wrote the other day.
On Ubuntu for example, do the following:
root@localhost:~# wget <a href="https://github.com/Icinga/icingabeat/releases/download/v1.0.0/icingabeat-1.0.0-amd64.deb">https://github.com/Icinga/icingabeat/releases/download/v1.0.0/icingabeat-1.0.0-amd64.deb</a> root@localhost:~# dpkg -i icingabeat-1.0.0-amd64.deb
On Linux systems, the configuration is located in /etc/icingabeat/icingabeat.yml. The connection parameters define the location of your Icinga installation and the API credentials. Make sure the API user you have created in your Icinga configuration has sufficient permissions to read the data you’d like to collect. You can decide which events you want to receive by setting types and configuring filters. The interval of the status poller is configurable as well. To disable the event stream completely, comment out the whole section or leave the types empty. To disable status polling, set an interval of 0s. See below for an example configuration:
icingabeat: host: “imaster-1.example.com" port: 5665 user: "icinga" password: "supersecret" skip_ssl_verify: false eventstream: types: - CheckResult - StateChange - Notification filter: 'match("webserver-*", event.host)' retry_interval: 10s statuspoller: interval: 10s
When your Icingabeat is up and running, you can import some pre defined dashboards. They are meant to give you some inspiration before you start exploring the data by yourself. For your convenience, Icingabeat has a script included that you can use for the import.
root@localhost:~# wget https://github.com/Icinga/icingabeat/releases/download/v1.0.0/icingabeat-dashboards-1.0.0.zip root@localhost:~# unzip icingabeat-dashboards-1.0.0.zip root@localhost:~# /usr/share/icingabeat/scripts/import_dashboards -dir /tmp/icingabeat-dashboards-1.0.0 -es http://127.0.0.1:9200
This will give you a set of saved searches, visualizations and dashboards in Kibana. The same ones that were used in this blogpost.
Monitoring data is essential when searching for failures in a technical environment, and so is logging data. Combining them in one interface may not solve all the issues operators have in their daily business, but it will lower the level of stress we have when searching for bugs and misbehaviours. By watching the performance of the monitoring setup operators can prevent bottlenecks in advance.
Blerim Sheqa works at NETWAYS, a company dedicated to open source software. He used to work as a Systems Engineer and help customers with their monitoring, logging and configuration management. Now, as a Product Manager, he takes care of integrations and technical partners in the Icinga project.