Engineering

How to perform incident management with ServiceNow and Elasticsearch

Welcome back! In the last blog we set up bidirectional communication between ServiceNow and Elasticsearch. We spent most of our time in ServiceNow, but from here on, we will be working in Elasticsearch and Kibana. By the end of this post, you'll have these two powerful applications working together to make incident management a breeze. Or at least a lot easier than you may be used to!

As with all Elasticsearch projects, we will create indices with mappings that suit our needs. For this project, we need indices that can hold the following data:

  • ServiceNow incident updates: This stores all information coming from ServiceNow to Elasticsearch. This is the index that ServiceNow pushes updates to.
  • Application uptime summary for ease of use: This is going to store how many total hours each application has been online. Consider this an intermediate data state for ease of use.
  • Application incident summary: This is going to store how many incidents each application has had, each application’s uptime, and the current MTBF (mean time between failure) for each application.

The last two indices are helper indices so that we don’t have to have a whole bunch of complicated logic running every time we refresh the Canvas workpad we'll create in part 3. They are going to be updated continuously for us through the use of transforms.

Create the three indices

To create the indices, you'll use the below guidance. Note that if you don't use the same names as used below, you might need to make adjustments in your ServiceNow setup.

servicenow-incident-updates

Following best practices, we are going to set up an index template, index alias, and then an index lifecycle management (ILM) policy. We will also create an index template so that the same mapping is applied to any future index created by our ILM policy. Our ILM policy is going to create a new index once 50GB of data is stored within the index and then delete it after 1 year. An index alias is going to be used so we can easily point towards the new index when it’s created without updating our ServiceNow business rule. 

# Create the ILM policy 
PUT _ilm/policy/servicenow-incident-updates-policy 
{ 
  "policy": { 
    "phases": { 
      "hot": {                       
        "actions": { 
          "rollover": { 
            "max_size": "50GB" 
          } 
        } 
      }, 
      "delete": { 
        "min_age": "360d",            
        "actions": { 
          "delete": {}               
        } 
      } 
    } 
  } 
}
# Create the index template 
PUT _template/servicenow-incident-updates-template 
{ 
  "index_patterns": [ 
    "servicenow-incident-updates*" 
  ], 
  "settings": { 
    "number_of_shards": 1, 
    "index.lifecycle.name": "servicenow-incident-updates-policy",       
    "index.lifecycle.rollover_alias": "servicenow-incident-updates"     
  }, 
  "mappings": { 
    "properties": { 
      "@timestamp": { 
        "type": "date", 
        "format": "yyyy-MM-dd HH:mm:ss" 
      }, 
      "assignedTo": { 
        "type": "keyword" 
      }, 
      "description": { 
        "type": "text", 
        "fields": { 
          "keyword": { 
            "type": "keyword", 
            "ignore_above": 256 
          } 
        } 
      }, 
      "incidentID": { 
        "type": "keyword" 
      }, 
      "state": { 
        "type": "keyword" 
      }, 
      "app_name": { 
        "type": "keyword" 
      }, 
      "updatedDate": { 
        "type": "date", 
        "format": "yyyy-MM-dd HH:mm:ss" 
      }, 
      "workNotes": { 
        "type": "text" 
      } 
    } 
  } 
}
# Bootstrap the initial index and create the alias 
PUT servicenow-incident-updates-000001 
{ 
  "aliases": { 
    "servicenow-incident-updates": { 
      "is_write_index": true 
    } 
  } 
}

app_uptime_summary & app_incident_summary

As both of these indices are entity-centric, they do not need to have an ILM policy associated with them. This is because we will only ever have one document per application that we are monitoring. To create the indices, you issue the following commands:

PUT app_uptime_summary 
{ 
  "mappings": { 
    "properties": { 
      "hours_online": { 
        "type": "float" 
      }, 
      "app_name": { 
        "type": "keyword" 
      }, 
      "up_count": { 
        "type": "long" 
      }, 
      "last_updated": { 
        "type": "date" 
      } 
    } 
  } 
}
PUT app_incident_summary 
{ 
  "mappings": { 
    "properties" : { 
        "hours_online" : { 
          "type" : "float" 
        }, 
        "incident_count" : { 
          "type" : "integer" 
        }, 
        "app_name" : { 
           "type" : "keyword" 
        }, 
        "mtbf" : { 
          "type" : "float" 
        } 
      } 
  } 
}

Set up the two transforms

Transforms are an incredibly useful and recent addition to the Elastic Stack. They provide the capability to convert existing indices into an entity-centric summary, which is great for analytics and new insights. An often overlooked benefit of transforms is their performance benefits. For example, instead of trying to calculate the MTBF for each application by query and aggregation (which would get quite complicated), we can have a continuous transform calculating it for us on a cadence of our choice. For example, every minute! Without transforms they would be calculated once for every refresh that each person does on the Canvas workpad. Meaning, if we have 50 people using the workpad with a refresh interval of 30 seconds, we run the expensive query 100 times per minute (which seems a bit excessive). While this wouldn’t be an issue for Elasticsearch in most cases, I want to take advantage of this awesome new feature which makes life much easier.

We are going to create two transforms: 

  • calculate_uptime_hours_online_transform: Calculates the number of hours each application has been online and responsive. It does this by utilizing the uptime data from Heartbeat. It will store these results in the app_uptime_summary index. 
  • app_incident_summary_transform: combines the ServiceNow data with the uptime data coming from the previously mentioned transform (yes... sounds a bit like a join to me). This transform will take the uptime data and work out how many incidents each application has had, bring forward how many hours it has been online and finally calculate the MTBF based on those two metrics. The resulting index will be called app_incident_summary.

calculate_uptime_hours_online_transform

PUT _transform/calculate_uptime_hours_online_transform 
{ 
  "source": { 
    "index": [ 
      "heartbeat*" 
    ], 
    "query": { 
      "bool": { 
        "must": [ 
          { 
            "match_phrase": { 
              "monitor.status": "up" 
            } 
          } 
        ] 
      } 
    } 
  }, 
  "dest": { 
    "index": "app_uptime_summary" 
  }, 
  "sync": { 
    "time": { 
      "field": "@timestamp", 
      "delay": "60s" 
    } 
  }, 
  "pivot": { 
    "group_by": { 
      "app_name": { 
        "terms": { 
          "field": "monitor.name" 
        } 
      } 
    }, 
    "aggregations": { 
      "@timestamp": { 
        "max": { 
          "field": "@timestamp" 
        } 
      }, 
      "up_count": { 
        "value_count": { 
          "field": "monitor.status" 
        } 
      }, 
      "hours_online": { 
        "bucket_script": { 
          "buckets_path": { 
            "up_count": "up_count" 
          }, 
          "script": "(params.up_count * 60.0) / 3600.0" 
        } 
      } 
    } 
  }, 
  "description": "Calculate the hours online for each thing monitored by uptime" 
}

app_incident_summary_transform 

PUT _transform/app_incident_summary_transform 
{ 
  "source": { 
    "index": [ 
      "app_uptime_summary", 
      "servicenow*" 
    ] 
  }, 
  "pivot": { 
    "group_by": { 
      "app_name": { 
        "terms": { 
          "field": "app_name" 
        } 
      } 
    }, 
    "aggregations": { 
      "incident_count": { 
        "cardinality": { 
          "field": "incidentID" 
        } 
      }, 
      "hours_online": { 
        "max": { 
          "field": "hours_online", 
          "missing": 0 
        } 
      }, 
      "mtbf": { 
        "bucket_script": { 
          "buckets_path": { 
            "hours_online": "hours_online", 
            "incident_count": "incident_count" 
          }, 
          "script": "(float)params.hours_online / (float)params.incident_count" 
        } 
      } 
    } 
  }, 
  "description": "Calculates the MTBF for apps by using the output from the calculate_uptime_hours_online transform", 
  "dest": { 
    "index": "app_incident_summary" 
  }, 
  "sync": { 
    "time": { 
      "field": "@timestamp", 
      "delay": "1m" 
    } 
  } 
}

Let’s now ensure that both transforms are running:

POST _transform/calculate_uptime_hours_online_transform/_start 
POST _transform/app_incident_summary_transform/_start

Uptime alerts to create tickets in ServiceNow

Apart from making a pretty Canvas workpad, the final step to close the loop is to actually create a ticket in ServiceNow if one doesn't already exist for this outage. To do this, we are going to use Watcher to create an alert. This alert has the steps outlined below. For context, it runs every minute. You can see all the uptime fields in the Heartbeat documentation.

1. Check to see what applications have been down in the past 5 minutes

This is the simple bit. We are going to get all Heartbeat events which have been down (term filter) in the past 5 minutes (range filter) grouped by the monitor.name (terms aggregation). This monitor.name field is following Elastic Common Schema (ECS) and will be synonymous to the value in our application name field. This is all achieved through the down_check input in the watcher below.

2. Get the top twenty tickets for each application and get the latest update for each ticket

This trends the line of complexity. We are going to search across our ingested ServiceNow data which is automatically ingested due to our ServiceNow business rule. The existing_ticket_check input uses multiple aggregations. The first is to group all of the applications together through the “apps” terms aggregation. For each app we then group the ServiceNow ticket incident IDs together by using a terms aggregation called incidents. Finally, for each incident found for each application, we are going to get the latest state by using the top_hits aggregation ordered by the @timestamp field. 

3. Merge the two feeds together and see if any tickets need creating

To achieve this, we use script payload transform. In short, it checks to see what is down by iterating over the down_check output and then checks to see if that particular application has an open ticket. If there is not a ticket currently in progress, new or on hold, it adds the application to a list which is returned and passed onto the action phase. 

This payload transform does quite a few checks in the middle of this to catch the edge cases I have outlined below, such as creating a ticket if the app hasn’t had any incident history before. The output of this transform is an array of application names.

4. If new, create the ticket in ServiceNow

We use the webhook action to create the ticket in ServiceNow by using its REST API. To do this, it uses the foreach parameter to iterate over the application names from the above array and then runs the webook action for each of them. It will only do this if there is one or more applications that need a ticket. Please ensure that you set the correct credentials and endpoint for ServiceNow.

PUT _watcher/watch/e146d580-3de7-4a4c-a519-9598e47cbad1 
{ 
  "trigger": { 
    "schedule": { 
      "interval": "1m" 
    } 
  }, 
  "input": { 
    "chain": { 
      "inputs": [ 
        { 
          "down_check": { 
            "search": { 
              "request": { 
                "body": { 
                  "query": { 
                    "bool": { 
                      "filter": [ 
                        { 
                          "range": { 
                            "@timestamp": { 
                              "gte": "now-5m/m" 
                            } 
                          } 
                        }, 
                        { 
                          "term": { 
                            "monitor.status": "down" 
                          } 
                        } 
                      ] 
                    } 
                  }, 
                  "size": 0, 
                  "aggs": { 
                    "apps": { 
                      "terms": { 
                        "field": "monitor.name", 
                        "size": 100 
                      } 
                    } 
                  } 
                }, 
                "indices": [ 
                  "heartbeat-*" 
                ], 
                "rest_total_hits_as_int": true, 
                "search_type": "query_then_fetch" 
              } 
            } 
          } 
        }, 
        { 
          "existing_ticket_check": { 
            "search": { 
              "request": { 
                "body": { 
                  "aggs": { 
                    "apps": { 
                      "aggs": { 
                        "incidents": { 
                          "aggs": { 
                            "state": { 
                              "top_hits": { 
                                "_source": "state", 
                                "size": 1, 
                                "sort": [ 
                                  { 
                                    "@timestamp": { 
                                      "order": "desc" 
                                    } 
                                  } 
                                ] 
                              } 
                            } 
                          }, 
                          "terms": { 
                            "field": "incidentID", 
                            "order": { 
                              "_key": "desc" 
                            }, 
                            "size": 1 
                          } 
                        } 
                      }, 
                      "terms": { 
                        "field": "app_name", 
                        "size": 100 
                      } 
                    } 
                  }, 
                  "size": 0 
                }, 
                "indices": [ 
                  "servicenow*" 
                ], 
                "rest_total_hits_as_int": true, 
                "search_type": "query_then_fetch" 
              } 
            } 
          } 
        } 
      ] 
    } 
  }, 
  "transform": { 
   "script": """ 
      List appsNeedingTicket = new ArrayList();  
      for (app_heartbeat in ctx.payload.down_check.aggregations.apps.buckets) { 
        boolean appFound = false;  
        List appsWithTickets = ctx.payload.existing_ticket_check.aggregations.apps.buckets;  
        if (appsWithTickets.size() == 0) {  
          appsNeedingTicket.add(app_heartbeat.key);  
          continue;  
        }  
        for (app in appsWithTickets) {   
          boolean needsTicket = false;  
          if (app.key == app_heartbeat.key) {  
            appFound = true;  
            for (incident in app.incidents.buckets) {  
              String state = incident.state.hits.hits[0]._source.state;  
              if (state == 'Resolved' || state == 'Closed' || state == 'Canceled') {  
                appsNeedingTicket.add(app.key);  
              }  
            }  
          }  
        }  
        if (appFound == false) {  
          appsNeedingTicket.add(app_heartbeat.key);  
        }  
      }  
      return appsNeedingTicket;  
      """ 
  }, 
  "actions": { 
    "submit_servicenow_ticket": { 
      "condition": { 
        "script": { 
          "source": "return ctx.payload._value.size() > 0" 
        } 
      }, 
      "foreach": "ctx.payload._value", 
      "max_iterations": 500, 
      "webhook": { 
        "scheme": "https", 
        "host": "dev94721.service-now.com", 
        "port": 443, 
        "method": "post", 
        "path": "/api/now/table/incident", 
        "params": {}, 
        "headers": { 
          "Accept": "application/json", 
          "Content-Type": "application/json" 
        }, 
        "auth": { 
          "basic": { 
            "username": "admin", 
            "password": "REDACTED" 
          } 
        }, 
       "body": "{'description':'{{ctx.payload._value}} Offline','short_description':'{{ctx.payload._value}} Offline','caller_id': 'elastic_watcher','impact':'1','urgency':'1', 'u_application':'{{ctx.payload._value}}'}", 
        "read_timeout_millis": 30000 
      } 
    } 
  }, 
  "metadata": { 
    "xpack": { 
      "type": "json" 
    }, 
    "name": "ApplicationDowntime Watcher" 
  } 
}

Conclusion

This is where we conclude the second installment of this project. In this section we have created some general configuration for Elasticsearch so that the indices and ILM policy exist. Within this section we also created the Transforms which calculate the uptime and MTBF for each application as well as the Watcher which monitors our uptime data. The key thing to take note here is that, if the Watcher notices that something goes down it first checks if a ticket exists and if not creates one in ServiceNow.

Interested in following along? The easiest way to do that is to use Elastic Cloud. Either log into the Elastic Cloud console or sign up for a free 14-day trial. You can follow the above steps with your existing ServiceNow instance or spin up a personal developer instance.

Also, if you’re looking to search over ServiceNow data along with other sources such as GitHub, Google Drive, and more, Elastic Workplace Search has a prebuilt ServiceNow connector. Workplace Search provides a unified search experience for your teams, with relevant results across all your content sources. It’s also included in your Elastic Cloud trial.

At this point, everything is functional! However, not what I would call visually appealing. So, in the third and final section of this project we will look at how you can use Canvas to create a great looking front-end to represent all of this data, as well as calculate the other metrics mentioned such as MTTA, MTTR and more. See you there!