March 13, 2014

Using Kibana for Business Intelligence

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

In this article we will set up a simple and easy to follow example on how to use Elasticsearch and Kibana for basic business intelligence. In our demo we will be using real world Twitter data which we'll feed into Elasticsearch and then inspect and analyze it by using the Kibana dashboard.

Introduction

Elasticsearch is by far the best and most awesome search engine out there today, and its adoption rate is unparalleled in the world of search engines. Kibana has likewise become the preferred tool to search, graph, inspect, visualize and analyse the data in your Elasticsearch clusters. Kibana is primarily used as a frontend to Logstash to visualize and analyse log data so you can follow trends or detect and inspect incidents. However, Kibana has the power to do so much more than that. In this article I will show you how to set up a small Elasticsearch cluster and configure Kibana to let you inspect and analyze real world Twitter data. I am not a programmer myself, so I must rely on some other tool to get the data from Twitter and into Elasticsearch. For this task I will be using Zapier, an online service that lets you connect and move data between apps that don't speak to each other natively. For the Elasticsearch cluster I will be using Found's hosted Elasticsearch service. If you run your own Elasticsearch cluster, you may use it for this proof of concept.

Setting up Elasticsearch

Create Your First Elasticsearch Cluster

Go to found.elastic.co and create an account. When you have created the account, log in to the Found admin console and create an Elasticsearch cluster by choosing New Cluster. With the 14 day free trial, you get an Elasticsearch cluster with 256MB reserved memory and 2GB reserved disk space. This should be sufficient for the example in this article.

Choose the region in which you want your cluster to run, then choose the Elasticsearch version. I recommend using the latest stable version. At the time of writing this is version 1.0.1. Enable the Kibana-plugin, and give your cluster a descriptive name, for example Twitter BI PoC.

Once your Elasticsearch cluster is created, note the endpoints given on the next page. You can later find the endpoints by going to the cluster's Overview page. See the figure below. You use these endpoints whenever you need to communicate with you Elasticsearch cluster. We always recommend to use the HTTPS endpoint, as traffic to and from your cluster will be encrypted.

The Found adminconsole overview page.

To check that your cluster has been successfully created, go to the Elasticsearch Head dashboard. Now you should see the cluster with the name “instance-0000000000X”. Look for the text "cluster health: green" (or "yellow") in the header. Then everything should be OK.

The Elasticsearch Head dashboard.

Create the Index

The next thing we need to do is to create the index where we store our Twitter data. Run the following script from a terminal window to create an index with the appropriate mapping.

Note: Remember to swap the endpoint URL in this script with the endpoint URL of your cluster. Also note that the index is called tweets. We will use this index name later.

#!/bin/bash

curl -XPUT "https://4ae884a4b3f9cef8988b4fae82274701-eu-west-1.foundcluster.com:9243/tweets" -d '{
  "settings": {
	  "analysis": {},
	  "mapping.ignore_malformed": true
  },
  "mappings": {
	  "elasticsearch": {
		  "dynamic_templates": [
			  {
				  "created_at": {
					  "match": "created_at",
					  "mapping": {
						  "type": "date",
						  "format": "EEE MMM dd HH:mm:ss Z yyyy"
					  }
				  }
			  },
			  {
				  "text": {
					  "match": "text",
					  "mapping": {
						  "type": "string",
						  "analyzer": "simple"
					  }
				  }
			  },
			  {
				  "strings": {
					  "match_mapping_type": "string",
					  "mapping": {
						  "type": "string",
						  "index": "not_analyzed"
					  }
				  }
			  },
			  {
				  "counts": {
					  "match": "*count",
					  "mapping": {
						  "type": "long"
					  }
				  }
			  },
			  {
					"geo": {
						"match": "geo",
						"mapping": {
							"type": "object",
							"enabled": false
						}
					}
			  },
			  {
					"coordinates": {
						"match": "coordinates",
						"mapping": {
							"type": "object",
							"enabled": false 
						}
					}
			  },
			  {
					"place": {
						"match": "place",
						"mapping": {
							"type": "object",
							"enabled": false
						}
					}
			  }
		  ]
	  }
  }
}'

For the sake of this demo, you don't need to understand how this script works. It is, however, important to specify these mappings in order to populate your Elasticsearch cluster with Twitter data. If you need an introduction to mappings, see Njål Karevoll's article An Introduction to Elasticsearch Mapping.

After you have run the script, go to the Head-dashboard again and verify that an index with the name tweets has been created. The number of docs should be 0.

Zapier

We will use the online service called Zapier to transfer data from Twitter to your Elasticsearch cluster. Zapier is basically a glue between two services that both have a public API. Zapier also speaks the web service protocol (HTTP/HTTPS). This is exactly the same protocol as a web browser uses to communicate with a web server. We can therefore use Zapier to transfer data to or from a service that speaks HTTP/HTTPS, which is the case for Elasticsearch.

A link between two services is called a Zap. The source service is called the Trigger App and the target service is called the Action App. In our example we will create a Zap between Twitter as the Trigger App and Elasticsearch as the Action App.

If you do not already have a Zapier account, go there now and create one. Zapier will collect data every 15 minutes or 5 minutes, depending on the plan. With the free plan you get 100 tasks, which means you will only be able to try out Zapier for a day or so. If you want to use Zapier for any serious production use or need to collect data over an extended period of time, you will need a paid plan.

Setting up Your Zap

After creating a Zapier account, Make a Zap! by clicking on the link on the top of the Zapier webpage.

Twitter is already registered as an app in Zapier, and is listed in the App Directory. However, Elasticsearch is not listed as an app in Zapier, but because Elasticsearch supports the HTTP protocol, we can communicate with Elasticsearch by using the Web Hook feature. We can send and retrieve data to and from Elasticsearch using HTTP/HTTPS commands. The most common commands are GET, POST, PUT and DELETE. We will use the POST command to feed data to our Elasticsearch cluster.

1. Choose a Trigger and Action

Select Twitter as the Trigger App, then select Search Mention as the Trigger.

Select Web Hook as the Action app, and POST as the Action.

Zapier Trigger and Action apps.

2. Select a Twitter Account

Specify a valid Twitter account.

3. Select a Web Hook Account

Nothing to be done here.

4. Filter Twitter Triggers

Choose your Search Term. We will use the word elasticsearch as our Search Term for this article.

5. Match up Twitter Search Mention With Web Hook POST

We only need to fill in the first two fields here.

The URL is the same as the endpoint you find on the Found admin console Overview page, appended the index name and a type. In our example, the type needs to be elasticsearch:

https://4ae884a4b3f9cef8988b4fae82274701-eu-west-1.foundcluster.com:9243/tweets/elasticsearch

The payload type is always json. Do not confuse the payload type with the type used in the URL above. They are two different things.

URL and Payload type.

A note on type:

The type used in the URL above is a user-defined name and should reflect the type of data - or tweets in this case - that you store in the index. We have chosen to call the type elasticsearch because we store tweets mentioning Elasticsearch. If you later choose to store tweets using different Search Terms and a different type name in the same index, you can use the type to filter your results in Kibana. You can also use different mappings for different types. In the mapping that we created at the beginning of this article, we specified elasticsearch as the type in the mappings section:

"mappings": {
	  "elasticsearch": {
	  		  "dynamic_templates": [

Consequently, if you use a different name for the type, e.g. cocacola, you'll need to replace elasticsearch with cocacola in the script. If you at this point decide to use a different Search Term and type name for your own proof of concept, simply delete the index from the Head dashboard and re-run the script with the new type name in the mapping.

6. Test This Zap

Test the Zap with the three samples, and check that everything returns successfully.

At this point the Zap has pushed some data into your Elasticsearch cluster. If you go to the Head dashboard, you should now see that there are 3 documents in the tweets index.

Elasticsearch Head with the 3 created documents.

7. Name and Turn This Zap on

Give the Zap a descriptive name and turn it on.

Name and turn the Zap on.

Let the Zap run for a few days to get plenty of tweets into your Elasticsearch cluster.

Setting up and Configuring Kibana

When we created the Elasticsearch cluster in the beginning of this article, we enabled the Kibana plugin right away. You can access your Kibana dashboard using your web browser:

https://4ae884a4b3f9cef8988b4fae82274701-eu-west-1.foundcluster.com:9243/_plugin/kibana/

Remember to modify the cluster URL with your own endpoint.

The first time you visit this page, you get the Welcome to Kibana screen. Click on the Sample Dashboard link to create a new Kibana dashboard.

The Basics of a Kibana Dashboard

A Kibana dashboard is made up of one or more rows, each containing one or more panels. Note that a row has a span of 12.

The header of the dashboard contains the name of the dashboard and some global menu items:

The Kibana dashboard header.

Saving and Loading Dashboards

You can save your dashboards and load them later by using the save and load buttons in the header. Go ahead and save this dashboard using a descriptive name. Every time you do a modification that you are happy with, you should save the dashboard so you don't loose your changes. You can also set a dashboard as your default dashboard by choosing Save -> Advanced -> Save as home.

Note: Due to a bug in the Kibana dashboard javascripts, you might need to select a Time filter when loading a saved dashboard. Otherwise, you might experience that the dashboard will load forever, never displaying any data.

Adding and Removing Rows

You can add, remove and reorder rows from the Configure dashboard menu under the Rows tab. Create a row called Tweet history and place it at the top of your dashboard. Remember to click the save button.

Configuring the Timepicker

In many cases, we want to inspect or analyze data for a given period of time. In order to do this, you'll have to tell Kibana which data field contains the timestamp. You do this from the Timepicker tab under the Configure dashboard menu.

Configure the Timepicker to use the created_at field in your tweets:

The Kibana Timepicker.

We are finally ready to start inspecting our Twitter data!

Using Kibana

Create a Histogram of All Tweets

In the empty row we created above, click the Add panel to empty row link. Select the histogram panel type and choose a Span of 12. A span of 12 will fill the entire width of the dashboard. Type in the text created_at in the Time Field box. Click save, and voilà! You have a nice histogram showing tweets over time.

Note: The default interval on the histogram is 1 year, so when you create the histogram and when you load the dashboard you will only see a solid bar in the histogram. To change this, click the View link in the histogram and choose an interval of 1 hour.

The tweet histogram.

Explore the histogram by choosing a different Time filter from the Kibana dashboard's header. You can also narrow the time selection by marking the desired time period with your mouse cursor.

Filters

Every time you narrow the selection of data, either using the histogram or by selecting a subset of data using other panels, a filter is created. You can see all filters by clicking on the Filtering button, just below the Query bar.

Often you need to remove some or all of these filters to display the desired data in Kibana. Personally, I like to leave the Filtering section open while working with Kibana.

Kibana filtering.

Filters can be toggled on or off using the checkbox, and some filters can even be edited directly in the Filtering section. Time filters can not be edited.

Remove all filters.

Twitter Users

Now we are going to create a panel that shows some statistics on twitter users that have tweeted about Elasticsearch lately.

Change the Row Height

First, change the height of row number two to 200px by choosing the Configure row button on the left side of the row. Also remove both panels from this row.

Configure a row in Kibana.

Create the Twitter Users Panel

Choose Add panel, then select the terms panel type. Name the panel Twitter Users and set Span to 6. Under Parameters change the Field value to user.screen_name and Length to 25. Uncheck Other under View options and click save.

Now you have a panel showing the 25 most active Elasticsearch tweeters for the selected time period.

The Twitter users panel.

Have a go at changing the Time filter and see how the Twitter Users chart changes. Also note how the Filtering section changes. With the two panels we have created, you can see who has been tweeting most during the previous week or the previous day.

Remove all filters from the Filtering section.

We see that johtani is by far the most active tweeter in the Elasticsearch community. Try to click on johtani's bar in the Twitter Users panel. Now look at the Tweet History panel: you can see when johtani tweeted about Elasticsearch. Also note the new filter that was created.

Remove this filter.

Inspecting Tweets

Take a look at the panel in the bottom row. This panel contains all the tweets that match your filters at any given time. If you have removed all filters, this panels will show every tweet in your Elasticsearch cluster. In the left column, you see all the Fields that a tweet may contain. Not all tweets contain all Fields. In the right column, you see the content of the tweets. If you click on a tweet you get a tabular overview of all Fields and corresponding Values for that tweet.

By checking off fields in the left column, you can toggle which values that will be displayed in the right column. Now check off the following fields: created_at, lang, retweet_count, text, user.name and user.screen_name. If you wish, you can change the order of the fields at the top of the right column.

Now go to the Twitter Users panel and click on the user imarshut's bar, then use the mouse cursor and highlight a short period where he tweeted about Elasticsearch. Now you can go to the bottom panel and look at imarshut's tweets during this period.

Retweet Characteristics

We'll finish off by creating one last panel that will enable us to look at the retweet characteristics of Elasticsearch related tweets.

In row two, add a panel of type terms, with a span of 6. Call this panel Retweet Count. Type retweet_count in the Field-box, set the Length to 20 and set Order to reverse_term. Then select table as the style and remove the checkmarks from Missing and Other.

The Terms panel type settings.

You will now get a panel that looks something like this:

The Retweet Count panel.

The Term here is the same as the retweet_count field in a tweet. Accordingly, the Count column shows the number of tweets that has been retweeted the number of times displayed in the Term column. For instance, we see that 1 tweet has been retweetet 346 times and 3 tweets have been retweeted 41 times. Click on the magnifying glass on the first line and see which tweet this is.

Remove the filter. Then click on a specific user in the Twitter Users panel and see how much this user has been retweeted. For the user¹ the retweet statistics would look something like this:

Retweet statistics for the user.

Retweet statistics for the user.²

One tweet has been retweetet 20 times, and 19 tweets have not been retweeted at all in the given period.

A Note on Elasticsearch and Security

In this article we have been very lax about the security of Elasticsearch and the data that we have pushed to Elasticsearch. This is not an issue when working with Twitter, as Twitter data is by and large public information and everyone have access to all data that Twitter exposes.

If you want to restrict access to the data you feed into Elasticsearch, you need to take some precautions. First of all you should always use the HTTPS endpoint so that all data to and from your cluster is encrypted. Second, you should configure your cluster to require a username and password, and possibly restrict access to certain hosts only. These settings can be configured from the Access Control page in the Found admin console.

If you are running Elasticsearch in production, Alex Brasetvik's article on how to secure your Elasticsearch cluster is mandatory reading: Securing Your Elasticsearch Cluster. I also recommend that you read his article on how to set up and maintaining your Elasticsearch clusters for production environments: Elasticsearch in Production.

Conclusion

Kibana is a really simple, yet powerful tool which allows us to inspect and analyze Elasticsearch data in a myriad of ways. We have only briefly explored some of the capabilites of Kibana, but we already have a practical tool to keep track of and do basic analysis of Twitter data. If you follow the examples in this article and explore Kibana on your own, you should have sufficient knowledge to set up useful dashboards for you own apps. Zapier is also a very handy tool if you find yourself in a situation where you need to transfer data from one app or service, to another. Go to the Zapier App Directory, and have a look at all the other apps Zapier supports. Happy Zapping!