2014年05月23日 ニュース

Elasticsearch Teams up with MIT Sloan for Data Analytics Hackathon

By Sejal Korenromp

Following from the success and popularity of the Hopper Hackathon we participated in late 2013, last week we sponsored the MIT Sloan Data Analytics Club Hackathon for our latest offering to Elasticsearch aficionados. More than 50 software engineers, business students and other open source software enthusiasts signed up to participate, and on a Saturday to boot! The full day's festivities included access to a huge storage and computing cluster, and everyone was set free to create something awesome using Elasticsearch.

With no time to lose, we kick started the session with Igor Motov and Binh Ly, two of our software engineers. They gave an overview of the Elasticsearch ELK stack, followed by an in-depth tutorial to get new users up to speed with the search engine's features and implementation. Even those never exposed to these technologies were wowed, especially by Kibana and how its simplicity and beauty provides a powerful data visualization solution.

Binh Ly in action: Kibana, Simple & Beautiful

The participants then split into 5 teams, and from that moment on it was heads down hacking time. Whilst the teams were diving into real world data sets – including Tweets, Wikipedia and TMDB datasets – our expert duo were at hand to answer questions, help troubleshoot, and advise and inspire along the way. No hackathon is complete without an element of competition, so with a prize incentive up for grabs, the teams were unstoppable.

The five finalists came up with some incredibly cool hacks, including:

1) Quimbly – A Digital Library – by Chris

Chris had the idea of taking ebooks from Project Gutenburg, extracting content and metadata, indexing it into Elasticsearch and providing a single search box that allows you to easily search through all the texts. Project Gutenburg contains 45K ebooks and a public domain API. He used Python to index the content into Elasticsearch and Node.js to build a search webpage. He was also thinking of eventually allowing multiple users to do collaborative annotations on each ebook.

2) Brand Sentiment Analysis – by Lior, Jon and Tor

This team looked at sentiment analysis for a number of food chains: Burger King, McDonald's, Subway and Chipotle. Using Twitter mentions, they looked for these brand names and associated sentiment words. e.g. “like" or “dislike" terms such as “awesome," “great," “best," “me gusto," “fun," “worst," “bad," etc. Based on their counts and ratios, they determined that Burger King had the most positive sentiment, and Chipotle had the most negative sentiment. Invaluable data for the next snack attack you might have, really.icon_smile.gif

3) Conference Data – by Theja

Theja had the idea of taking data on papers submitted to the CVPR 2013 conference, extracting metadata and PDF content, indexing it into Elasticsearch and making it searchable. Currently, only very few pieces of metadata (e.g. title and author) are searchable online, so making the body/content of these papers searchable benefits research by making discovery of these papers much easier. He used Python to do all the processing and indexing, standardizing the data all into a common schema (e.g. paper ID, filename/url, keywords, authors, body, etc.).

Theja then used Kibana to dig into his newly generated dataset, as well as perform some interesting aggregations. He came up with some interesting discoveries, such as what nationalities are the most common amongst paper submitters, the most common topics that are in the body of each paper, etc.

4) Twitter based sentiment analyzer – by Siddharth

Siddharth's idea was to provide the ability to dynamically search on any topic/brand in Twitter stream and dynamically perform sentiment analysis on Tweets. This sentiment analyzer could be used to search for movies and see how people rate a movie on Twitter. He also described how he could extend this to other ideas like machine learning and correlation. For example, sentiment on a specific product could be correlated to a company's stock price performance.

5) Statistics on Movies and Wikipedia – by Titka and Jacek

Since they had no prior Elasticsearch experience, this team decided to use Kibana, exploring Wikipedia data for a correlation between a person's profession and their chances of success. They also dived into the TMDB dataset with some analysis on movies by different characteristics such as cost, running time, revenue, language, country etc. Using Kibana's panel overviews, they were able to visualize their data finding relationships and correlations among these characteristics. For example, they demonstrated how cost was correlated to runtime and revenue.

Home stretch!

And the winner is …

All in all, it was a tough competition, and every team had great ideas for ways in which to analyze their data sets with the ELK stack. This time the 1st prize went to Titka and Jacek for their exploration of Statistics on Movies and Wikipedia – well done to you both. For folks who didn't walk away with Amazon gift cards, the fine folks at O'Reilly Media donated a number of titles on data science from their quite impressive catalog. (Thanks again, Marsee!)

Thanks once again to the crew at the MIT Sloan Data Analytics Club, especially Lior Belenki, for organizing and supporting this initiative. A huge shout out of love to our developers Igor and Binh for helping spread the knowledge of the ELK stack amongst all our attendees. And, of course, thanks to the MIT Sloan School of Management for hosting what turned out to be yet another great community activity around Elasticsearch!