Customers

Using Elastic Graph to Analyze #Trump Data

In this blog, I will discuss how I've used Graph, a new Elastic X-Pack plugin for Elasticsearch and Kibana, to explore relationships in Twitter data based on Donald Trump's political campaign.

A great source of free data to play with in Graph is Twitter and Logstash provides a Twitter input stream as standard. The Logstash Twitter input constantly watches the feed for a specific search term and in my example, I used the word “Trump”. Leaving Logstash running on my laptop for just an hour gives me a collection of thousands of tweets for Graph to analyse. 

A useful insight you might want to measure on Twitter is what hashtags are popular and which ones are frequently used together. As a business, gaining this insight can help maximise your social media presence. Looking at frequently used hashtags is trivial with Kibana, but what about hashtags frequently used together? What are people using these hashtags talking about? This is where Graph comes in. Using the Kibana Graph plugin to search for “#Trump”, instantly it produces a graph of interconnected hashtags.

Trump-hastags-graph-dashboard.png

Straight away Graph shows us popular hashtags associated with the Trump presidential campaign and links those which are frequently seen together. In contrast searching the hashtag “#Cruz” for the dataset gives a completely different impression.

Cruz-hashtags-graph-dashboard.png

Using the advanced features of Graph, not only can we assess links between hashtags and other hashtags but also links between hashtags and other data fields. For my case, a good example of this is to categorise America by time-zone, splitting our sample into three groups we can see what hashtags are most commonly associated with each of these time-zones.

timezones-graph-dashboard.png

The data fields provided by Logstash have a vast potential for gaining insight on topics that may be very difficult to understand otherwise, like I have studied hashtags here we can also look at the raw content of the tweets and common associated words, as well as the words Twitter users use to describe themselves!

political-views-graph-dashboard.png

However, one place where the standard Logstash Twitter feed falls short is Twitter followers. What if I want to know what users are popular in a subject and who people who follow them also tend to follow? Fortunately, Logstash can read in pretty much any file, so it’s pretty easy to use the Twitter API to bring down the data we’re interested in and pass it to Logstash.

twitter-followers-graph-dashboard.png

Straight away, we can build up a graph showing common following trends, the thickness of the lines shows the strength of the links. Clearly Graph is a powerful tool for identifying key, associated and similar users quickly and easily.

A subtle issue that can come about in graph analysis is results being dominated by repeated data from one source. For example, if one user tweets one hundred times more than another then their description will be weighted one hundred times stronger, such as when I search for the term “republican” in user descriptions:

republican-significant-graph-dashboard.PNG

Here at first glance the results seem good, but a couple of results seem a bit strange to see, are all Republicans cowgirls? I think it is more likely there is one Republican cowgirl who tweets very frequently. Fortunately, Graph has a solution to this issue, using the diversity field setting; one can limit the consideration given to content with the same value of a certain field. For our case it makes sense to limit this to one tweet per user ID. Repeating the analysis with this new setting gives a very different picture:

republican-diversity-graph-dashboard.PNG

Another one of the key advantages of the Graph algorithm is its ability to filter out popular “super-connected” terms and identify only “significant links”. This feature can be disabled if needed. For example, repeating the search for “republican” without significant links, graph pulls back loads of connecting words, “and”, “of”, “the” etc. these words do appear frequently with the world “republican”, however they also appear frequently in descriptions without the word “republican” and are hence not reflective of descriptions featuring the word.

republican-diversity-non-significant-graph dashboard.png

All the relationships displayed in the Kibana app are as a result of querying the Elasticsearch Graph API. The app allows you to view the raw JSON, making it easy to integrate Graph into a custom application.

To get an even better idea of the power of Graph watch this webinar on the Elastic site, this video demonstrates the features of Graph as applied to other datasets such as Wikipedia and LastFM data.

This blog appeared last month on our partner Aiimi’s blog.


Jack_headshot.jpg

Jack Lawton is a Technical Consultant at Aiimi, an Elastic partner specialising in information management consultancy. After studying Physics at Manchester, Jack is now exploring cutting edge data science technology to evolve Aiimi’s services and meet rising demands.