We are an Operations team at 1&1 Internet SE working on executable and measurable BPMN processes (Business Process Model and Notation) that fulfils orders for all hosting products. An order process starts when the customer approves his order with a click on the “Confirmation” button at the 1&1 shop. As part of the process the customer details are being checked, the contract created, all technical features are provisioned and the customer communication is taken care of.
In the past, log file analysis was time consuming and only possible with single customer orders. With the importance of log files in mind we were searching for a solution to analyze our log files faster, more accurate and without grep commands. Using Elasticsearch, Logstash and Kibana was an experiment at the beginning but meanwhile it turned into a tool that determines and optimizes our everyday work very fast.
What operating drives
So what do we operate? We are working on 10 BPMN processes with approx. 50,000 process instances a week. The logs generated by these processes are about 50GB a day. The log files contain technical information as well as business information. The Elastic Stack gives us the ability to analyze these logs in real-time and provides transparency on customer orders.
As an operations team we are driven by errors and incidents that attack customer orders. Our primary goal is customer satisfaction which we are trying to maintain by observing process SLAs like “in time provisioning” of customer orders. For being meaningful about the health of our processes we needed an overview as well as a categorization of errors in our processes. The questions we needed to answer were:
- What errors occur in our processes?
- Is an error process specific or is it a common problem?
- What is the total number of errors and how did it evolve?
- Are these errors known and is anybody working on it?
- Are these errors known and is there a known work-around?
Extending the Elastic Stack
Not all of the information can be taken from our process logs. To completely answer our questions we needed to enrich our application logs with information from other sources like issue tracker, databases, etc.
The data sources in detail are:
- Error pattern configuration (knowledge database, manually added)
- Process information (productive process information from different databases)
For conditioning this information we use the ETL tool Pentaho Kettle.
Image 1: Process for conditioning and loading data to Elasticsearch
Dashboard as a tool
Our Kibana dashboard helps us working with our processes every day.
Image 2: Kibana dashboard - Overview
(1) Number of errors by process (pie and table)
(2) Number of errors by issue tool ticket (pie and table)
(3) Number of errors by description text (pie and table)
(4) Full list of errors
Image 3: Kibana dashboard - Linking
In order to make operations more efficient, the dashboard integrates other tools for example linking to related bugs in our issue tracker.
In detail the dashboard provides:
(1) Link to issue tracker
(2) Link to central support tool showing detailed contract and customer information
(3) Link to Kibana dashboard showing all log information concerning that order (for drilldown)
(4) Link to information tool showing detailed order and customer information
We are now using the Elastic Stack for 2 months. The outcome is significant.
More focus on what is better now e.g. increased error detection and addressing rate by 30 to 90 percent.
- Faster error detection
- More than 90 percent of our errors are addressed (coming from about 60 percent)
- Manual reporting is obsolete because dashboard is visible for everyone
- One central “tool/dashboard” that connects operating tooling with each other
- Visibility of errors across all processes (not just process specific)
- The Elastic Stack is also used by non-technical people designing dashboards
In the future…
…we will concentrate on using the Elastic Stack for real-time monitoring purposes. For example a visualization of order content will help us to cluster our orders for estimation of campaigns (e.g. new product release). Therefore we need to deal with Apache SPARK for data processing and computations. Spark will calculate different KPIs (Key Performance Indicators) for us e.g. the duration of different process periods.
Benjamin Speckmann is Business Process Manager at 1&1 Internet SE. Previously he was working as a consultant at NTT DATA Deutschland GmbH. Since the beginning of his career Benjamin deals with executable BPMN processes (Business Process Model and Notation). Monitoring, Data-Analysis and Reporting of KPI’s was always a part of his work. He holds a M.S. in Computer Science from Eastern Michigan University as well as the University of Applied Sciences in Karlsruhe.
Christian Hatz is Advanced System Analyst at 1&1 Internet SE.
Previously Christian was working as a Requirements Engineer delivering business concepts for offshore developed business process monitoring solutions.
He is experienced in JBoss EAP and BPM dominated infrastructure. In 2015 Christian implemented the Elastic Stack for efficient monitoring and analysis of technical and business data in Order Management.