Tech Topics

Data Visualization with ElasticSearch and Protovis

The primary purpose of a search engine is, quite unsurprisingly: searching. You pass it a query, and it returns bunch of matching documents, in the order of relevance. We can get creative with query construction, experimenting with different analyzers for our documents, and the search engine tries hard to provide best results.

Nevertheless, a modern full-text search engine can do much more than that. At its core lies the inverted index, a highly optimized data structure for efficient lookup of documents matching the query. But it also allows to compute complex aggregations of our data, called facets.

The usual purpose of facets is to offer the user a faceted navigation, or faceted search. When you search for “camera” at an online store, you can refine your search by choosing different manufacturers, price ranges, or features, usually by clicking on a link, not by fiddling with the query syntax.

A canonical example of a faceted navigation at LinkedIn is pictured below.

Faceted search is one of the few ways to make powerful queries accessible to your users: see Moritz Stefaner's experiments with “Elastic Lists” for inspiration.

But, we can do much more with facets then just displaying these links and checkboxes. We can use the data for makings charts, which is exactly what we'll do in this article.

Live Dashboards

In almost any analytical, monitoring or data-mining service you'll hit the requirement “We need a dashboard!” sooner or later. Because everybody loves dashboards, whether they're useful or just pretty. As it happens, we can use facets as a pretty powerful analytical engine for our data, without writing any OLAP implementations.

The screenshot below is from a social media monitoring application which uses ElasticSearch not only to search and mine the data, but also to provide data aggregation for the interactive dashboard.

Ataxo Social Insider Dashboard

When the user drills down into the data, adds a keyword, uses a custom query, all the charts change in real-time, thanks to the way how facet aggregation works. The dashboard is not a static snapshot of the data, pre-calculated periodically, but a truly interactive tool for data exploration.

In this article, we'll learn how to retrieve data for charts like these from ElasticSearch, and how to create the charts themselves.

Pie charts with a terms facet

For the first chart, we'll use a simple terms facet in ElasticSearch. This facet returns the most frequent terms for a field, together with occurence counts.

Let's index some example data first.

curl -X DELETE "http://localhost:9200/dashboard"
curl -X POST "http://localhost:9200/dashboard/article" -d '
{ "title" : "One",
"tags" : ["ruby", "java", "search"]}
'
curl -X POST "http://localhost:9200/dashboard/article" -d '
{ "title" : "Two",
"tags" : ["java", "search"] }
'
curl -X POST "http://localhost:9200/dashboard/article" -d '
{ "title" : "Three",
"tags" : ["erlang", "search"] }
'
curl -X POST "http://localhost:9200/dashboard/article" -d '
{ "title" : "Four",
"tags" : ["search"] }
'
curl -X POST "http://localhost:9200/dashboard/_refresh"

As you see, we are storing four articles, each with a couple of tags; an article can have multiple tags, which is trivial to express in ElasticSearch's document format, JSON.

Now, to retrieve “Top Ten Tags” across the documents, we can simply do:

curl -X POST "http://localhost:9200/dashboard/_search?pretty=true" -d '
{
"query" : { "match_all" : {} },

"facets" : {
"tags" : { "terms" : {"field" : "tags", "size" : 10} }
}
}
'

You can see that we are retrieving all documents, and we have defined a terms facet called “tags”. This query will return something like this:

{
"took" : 2,
// ... snip ...
"hits" : {
"total" : 4,
// ... snip ...
},
"facets" : {
"tags" : {
"_type" : "terms",
"missing" : 1,
"terms" : [
{ "term" : "search", "count" : 4 },
{ "term" : "java", "count" : 2 },
{ "term" : "ruby", "count" : 1 },
{ "term" : "erlang", "count" : 1 }
]
}
}
}

We are interested in the facets section of the JSON, notably in the facets.tags.terms array. It tells us that we have four articles tagged search, two tagged java, and so on. (Of course, we could add a size parameter to the query, to skip the results altogether.)

Suitable visualization for this type of ratio distribution is a pie chart, or its variation: a donut chart. The end result is displayed below (you may want to check out the working example).

We will use Protovis, a JavaScript data visualization toolkit. Protovis is 100% open source, and you could think of it as Ruby on Rails for data visualization; in stark contrast to similar tools, it does not ship with a limited set of chart types to “choose” from, but it defines a set of primitives and a flexible domain-specific language so you can easily build your own custom visualizations. Creating pie charts is pretty easy in Protovis.

Since ElasticSearch returns JSON data, we can load it with a simple Ajax call. Don't forget that you can clone or download the full source code for this example.

First, we need a HTML file to contain our chart and to load the data from ElasticSearch:

<!DOCTYPE html>
<html>
<head>
<title>ElasticSearch Terms Facet Donut Chart</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<!-- Load JS libraries -->
<script src="jquery-1.5.1.min.js"></script>
<script src="protovis-r3.2.js"></script>
<script src="donut.js"></script>
<script>
$( function() { load_data(); });

var load_data = function() {
$.ajax({ url: 'http://localhost:9200/dashboard/article/_search?pretty=true'
, type: 'POST'
, data : JSON.stringify({
"query" : { "match_all" : {} },

"facets" : {
"tags" : {
"terms" : {
"field" : "tags",
"size" : "10"
}
}
}
})
, dataType : 'json'
, processData: false
, success: function(json, statusText, xhr) {
return display_chart(json);
}
, error: function(xhr, message, error) {
console.error("Error while loading data from ElasticSearch", message);
throw(error);
}
});

var display_chart = function(json) {
Donut().data(json.facets.tags.terms).draw();
};

};
</script>
</head>
<body>

<!-- Placeholder for the chart -->
<div id="chart"></div>

</body>
</html>

On document load, we retrieve exactly the same facet, via Ajax, as we did earlier with curl. In the jQuery Ajax callback, we pass the returned JSON to the Donut() function via the display_chart() wrapper.

The Donut() function itself is displayed, with annotations, below:

// =====================================================================================================
// A donut chart with Protovis - See http://vis.stanford.edu/protovis/ex/pie.html
// =====================================================================================================
var Donut = function(dom_id) {

if ('undefined' == typeof dom_id) { // Set the default DOM element ID to bind
dom_id = 'chart';
}

var data = function(json) { // Set the data for the chart
this.data = json;
return this;
};

var draw = function() {

var entries = this.data.sort( function(a, b) { // Sort the data by term names, so the
return a.term < b.term ? -1 : 1; // color scheme for wedges is preserved
}), // with any order

values = pv.map(entries, function(e) { // Create an array holding just the counts
return e.count;
});
// console.log('Drawing', entries, values);

var w = 200, // Dimensions and color scheme for the chart
h = 200,
colors = pv.Colors.category10().range();

var vis = new pv.Panel() // Create the basis panel
.width(w)
.height(h)
.margin(0, 0, 0, 0);

vis.add(pv.Wedge) // Create the "wedges" of the chart
.def("active", -1) // Auxiliary variable to hold mouse over state
.data( pv.normalize(values) ) // Pass the normalized data to Protovis
.left(w/3) // Set-up chart position and dimension
.top(w/3)
.outerRadius(w/3)
.innerRadius(15) // Create a "donut hole" in the center
.angle( function(d) { // Compute the "width" of the wedge
return d * 2 * Math.PI;
})
.strokeStyle("#fff") // Add white stroke

.event("mouseover", function() { // On "mouse over", set the "wedge" as active
this.active(this.index);
this.cursor('pointer');
return this.root.render();
})

.event("mouseout", function() { // On "mouse out", clear the active state
this.active(-1);
return this.root.render();
})

.event("mousedown", function(d) { // On "mouse down", perform action,
var term = entries[this.index].term; // such as filtering the results...
return (alert("Filter the results by '"+term+"'"));
})


.anchor("right").add(pv.Dot) // Add the left part of he "inline" label,
// displayed inside the donut "hole"

.visible( function() { // The label is visible when its wedge is active
return this.parent.children[0]
.active() == this.index;
})
.fillStyle("#222")
.lineWidth(0)
.radius(14)

.anchor("center").add(pv.Bar) // Add the middle part of the label
.fillStyle("#222")
.width(function(d) { // Compute width:
return (d*100).toFixed(1) // add pixels for percents
.toString().length*4 +
10 + // add pixels for glyphs (%, etc)
entries[this.index] // add pixels for letters (very rough)
.term.length*9;
})
.height(28)
.top((w/3)-14)

.anchor("right").add(pv.Dot) // Add the right part of the label
.fillStyle("#222")
.lineWidth(0)
.radius(14)


.parent.children[2].anchor("left") // Add the text to label
.add(pv.Label)
.left((w/3)-7)
.text(function(d) { // Combine the text for label
return (d*100).toFixed(1) + "%" +
' ' + entries[this.index].term +
' (' + values[this.index] + ')';
})
.textStyle("#fff")

.root.canvas(dom_id) // Bind the chart to DOM element
.render(); // And render it.
};

return { // Create the public API
data : data,
draw : draw
};

};

As you can see, with a simple transformation of JSON data returned from ElasticSearch, we're able to create rich, attractive visualization of tag distribution among our articles.

It's worth repeating that the visualization will work in exactly the same way when we use a different query, such as displaying only articles written by a certain author or published in certain date range.

Timelines with a date histogram facets

Protovis makes it very easy to create another common form of visualization: the timeline. Any type of data, tied to a certain date, such as an article being published, an event taking place, a purchase being completed can be visualized on a timeline.

The end result should look like this:

So, let's store handful of articles with a published date in the index.

curl -X DELETE "http://localhost:9200/dashboard"
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "1", "published" : "2011-01-01" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "2", "published" : "2011-01-02" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "3", "published" : "2011-01-02" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "4", "published" : "2011-01-03" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "5", "published" : "2011-01-04" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "6", "published" : "2011-01-04" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "7", "published" : "2011-01-04" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "8", "published" : "2011-01-04" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "9", "published" : "2011-01-10" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "10", "published" : "2011-01-12" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "11", "published" : "2011-01-13" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "12", "published" : "2011-01-14" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "13", "published" : "2011-01-14" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "14", "published" : "2011-01-15" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "15", "published" : "2011-01-20" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "16", "published" : "2011-01-20" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "17", "published" : "2011-01-21" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "18", "published" : "2011-01-22" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "19", "published" : "2011-01-23" }'
curl -X POST "http://localhost:9200/dashboard/article" -d '{ "t" : "20", "published" : "2011-01-24" }'
curl -X POST "http://localhost:9200/dashboard/_refresh"

To retrieve the frequency of articles being published, we'll use a date histogram facet in ElasticSearch.

curl -X POST "http://localhost:9200/dashboard/_search?pretty=true" -d '
{
"query" : { "match_all" : {} },

"facets" : {
"published_on" : {
"date_histogram" : {
"field" : "published",
"interval" : "day"
}
}
}
}
'

Notice how we set the interval to day; we could easily change the granularity of the histogram to week, month, or year.

This query will return JSON looking like this:

{
"took" : 2,
// ... snip ...
"hits" : {
"total" : 4,
// ... snip ...
},
"facets" : {
"published" : {
"_type" : "histogram",
"entries" : [
{ "time" : 1293840000000, "count" : 1 },
{ "time" : 1293926400000, "count" : 2 }
// ... snip ...
]
}
}
}

We are interested in the facets.published.entries array, as in the previous example. And again, we will need some HTML to hold our chart and load the data. Since the mechanics are very similar, please refer to the full source code for this example.

With the JSON data, it's very easy to create rich, interactive timeline in Protovis, by using a customized area chart.

The full, annotated code of the Timeline() JavaScript function is displayed below.

// =====================================================================================================
// A timeline chart with Protovis - See http://vis.stanford.edu/protovis/ex/area.html
// =====================================================================================================

var Timeline = function(dom_id) {
if ('undefined' == typeof dom_id) { // Set the default DOM element ID to bind
dom_id = 'chart';
}

var data = function(json) { // Set the data for the chart
this.data = json;
return this;
};

var draw = function() {

var entries = this.data; // Set-up the data
entries.push({ // Add the last "blank" entry for proper
count : entries[entries.length-1].count // timeline ending
});
// console.log('Drawing, ', entries);

var w = 600, // Set-up dimensions and scales for the chart
h = 100,
max = pv.max(entries, function(d) {return d.count;}),
x = pv.Scale.linear(0, entries.length-1).range(0, w),
y = pv.Scale.linear(0, max).range(0, h);

var vis = new pv.Panel() // Create the basis panel
.width(w)
.height(h)
.bottom(20)
.left(20)
.right(40)
.top(40);

vis.add(pv.Label) // Add the chart legend at top left
.top(-20)
.text(function() {
var first = new Date(entries[0].time);
var last = new Date(entries[entries.length-2].time);
return "Articles published between " +
[ first.getDate(),
first.getMonth() + 1,
first.getFullYear()
].join("/") +

" and " +

[ last.getDate(),
last.getMonth() + 1,
last.getFullYear()
].join("/");
})
.textStyle("#B1B1B1")

vis.add(pv.Rule) // Add the X-ticks
.data(entries)
.visible(function(d) {return d.time;})
.left(function() { return x(this.index); })
.bottom(-15)
.height(15)
.strokeStyle("#33A3E1")

.anchor("right").add(pv.Label) // Add the tick label (DD/MM)
.text(function(d) {
var date = new Date(d.time);
return [
date.getDate(),
date.getMonth() + 1
].join('/');
})
.textStyle("#2C90C8")
.textMargin("5")

vis.add(pv.Rule) // Add the Y-ticks
.data(y.ticks(max)) // Compute tick levels based on the "max" value
.bottom(y)
.strokeStyle("#eee")
.anchor("left").add(pv.Label)
.text(y.tickFormat)
.textStyle("#c0c0c0")

vis.add(pv.Panel) // Add container panel for the chart
.add(pv.Area) // Add the area segments for each entry
.def("active", -1) // Auxiliary variable to hold mouse state
.data(entries) // Pass the data to Protovis
.bottom(0)
.left(function(d) {return x(this.index);}) // Compute x-axis based on scale
.height(function(d) {return y(d.count);}) // Compute y-axis based on scale
.interpolate('cardinal') // Make the chart curve smooth
.segmented(true) // Divide into "segments" (for interactivity)
.fillStyle("#79D0F3")

.event("mouseover", function() { // On "mouse over", set segment as active
this.active(this.index);
return this.root.render();
})

.event("mouseout", function() { // On "mouse out", clear the active state
this.active(-1);
return this.root.render();
})

.event("mousedown", function(d) { // On "mouse down", perform action,
var time = entries[this.index].time; // eg filtering the results...
return (alert("Timestamp: '"+time+"'"));
})

.anchor("top").add(pv.Line) // Add thick stroke to the chart
.lineWidth(3)
.strokeStyle('#33A3E1')

.anchor("top").add(pv.Dot) // Add the circle "label" displaying
// the count for this day

.visible( function() { // The label is only visible when
return this.parent.children[0] // its segment is active
.active() == this.index;
})
.left(function(d) { return x(this.index); })
.bottom(function(d) { return y(d.count); })
.fillStyle("#33A3E1")
.lineWidth(0)
.radius(14)

.anchor("center").add(pv.Label) // Add text to the label
.text(function(d) {return d.count;})
.textStyle("#E7EFF4")

.root.canvas(dom_id) // Bind the chart to DOM element
.render(); // And render it.
};

return { // Create the public API
data : data,
draw : draw
};

};

Be sure to check out the documentation on the area primitive in Protovis, and watch what happens when you change interpolate('cardinal') to interpolate('step-after'). You should have no problems to draw a stacked area chart from multiple facets, add more interactivity, and completely customize the visualization.

The important thing to notice here is that the chart fully responds to any queries we pass to ElasticSearch, making it possible to simply and instantly visualize metrics such as “Display publishing frequence of this author on this topic in last three months”, with a query such as:

author:John AND topic:Search AND published:[2011-03-01 TO 2011-05-31]

tl;dr

When you need to make rich, interactive data visualization for complex, ad-hoc queries, using data returned by facets from Elasticsearch may well be one of the easiest ways to do it, since you can just pass the JSON response to a toolkit like Protovis.

By adapting the approach and code from this article, you should have a working example for your data in couple of hours.