Tech Topics

Tiny Data: Rapid Development with Elasticsearch

Today we're pleased to bring you the story of the creation of SeeMeSpeak, a Ruby application that allows users to record gestures for those learning sign language. Florian Gilcher, one of the organizers of the Berlin Elasticsearch User Group participated in a hackathon last weekend with three friends, resulting in this brand new open source project using Elasticsearch on the back end.

Last weekend we participated in RailsRumble 2013, a 48 hour hacking contest to build something fancy on top of Ruby (nowadays, Rails is not required). Beyond fancy, we wanted to build something that solves a real problem that we could continue developing after the hackathon concluded. It turns out that we found a common scratch to itch: our entire group was interested in sign language.

The Project

Sadly, there are almost no good learning resources for sign language on the internet. If material is available, licensing is a hassle or both the licensing and the material is poorly documented. Documenting sign language yourself is also hard, because producing and collecting videos is difficult. You need third-party recording tools, video conversion and manual categorization. That's a sad state in a world where every notebook has a usable camera built in!

Our idea was to leverage modern browser technologies to provide an easy recording function and a quick interface to categorize the recorded words. The result is SeeMeSpeak.

We Have a Hundred Problems...

Handling UserMedia (video and audio) in the browser is still in its infancy. Especially with recording, which mostly works through elaborate hacks like whammy.js that need to be set up very carefully. Given only about 24 hours of full work - our team valued sleep and rest - we expected recording and proper video conversion to take up most of our time.

... The Data Store Shouldn't Be One of Them

Thus, we were looking for a data store that:

  • is easy to set up for all developers
  • is easy to handle on the server
  • solves our immediate problems
  • helps us to iterate quickly
  • allows for future refinements in our problem space
  • is resource-friendly

The last point stems from the contest rules: all applications have to run on a Linode instance with 1GB of memory. Considering that we have to convert videos on the same machine, we cannot be wasteful.

Our problem is language and searching, so Apache Lucene is an obvious pick. Three of our team members had already worked with Elasticsearch and the decision was made quickly; we used Elasticsearch and couldn't be happier with it.

Elasticsearch for Tiny Data

Elasticsearch supported the goals of our fast prototype very well:

Setting up a reasonably configured elasticsearch instance as a developer is as easy as:

  • download and unzip
  • cd elasticsearch; bin/elasticsearch -f

With packages available for many platforms, Elasticsearch is quickly configured on the server as well. Also, contrary to popular belief, Elasticsearch can be run in a small memory space by setting ES_HEAP_SIZE to a small value. Given that we expected less than 5000 documents for our initial prototype, we were able to easily fit Elasticsearch in less around 256MB of memory.

Setting up the new offical Ruby client was also a breeze, it integrates well with the connection-less data models provided by Virtus.

On the content side, we had 3 main problems: tagging, flagging and transcription. The transcriptions to a video should be searchable. At the same time, we want to flag videos by a fixed set of criteria (if a word is vulgar or an insult, we want to know). Tagging is free-form and allows users to categorize words. All this is possible using Elasticsearch's dynamic mapping. Our main datatype ended up looking roughly like this:

{
  "transcription": "hello from seemespeak",
  "tags": ["funny"],
  "flags": ["casual"],
  "reviewed": false,
  "language": "ASL"
}

This is a very workable application-side data format that Elasticsearch's dynamic mapping indexes very well. More specifically, it gives you all you need to search it as expected:

  • The match query makes implementing a search box over the transcription field very easy.
  • tags, flags and review status can be easily handled using the term and terms filters. A bool filter helps you combining those.
  • For the random video on the frontpage and the full list, we used the newly introduced function_score query with a randomly changing seed, giving every user different, but stable results.

We found that the Elasticsearch's query language makes is very easy to build queries iteratively. We started with a plain match_all query to give frontend developers enough content to start building and added filters and randomizations over the first day. By the end of the first day, we were done with all search scenarios.

The current SeeMeSpeak is a prototype. We didn't venture into fine-tuning by implementing custom analyzers, e.g. for proper stemming and synonym searches. We were confident that we could do this on the second day, but wanted to invest our time in other things, like translating our German corpus to English for our English-speaking visitors and translating the whole site into two languages.

All in all, Elasticsearch allowed us not to worry too much about data storage and push out a minimum viable version of our product with usable features over the course of a weekend.

Wrapping Up

Elasticsearch is not only a great database for "big data," it is also a great fit for rapid application development in early stages. We were able to validate all our scenarios and use-cases very quickly. As always, investing time and effort into your data store will improve results in the long run, but we were impressed by the number of features we got out of Elasticsearch so quickly without starting with fine tuning.

How You Can Help

SeeMeSpeak is open source and solves a real issue: building an openly licenced and well tagged corpus for sign languages. You can contribute on GitHub. Also, you can help us by voting or leaving feedback on the RailsRumble site. Feel free to try out our recording function. Just try to say hello.

Thank you!

Florian Gilcher, Elasticsearch Berlin User Group Co-Organizer, Klaus Fl, Jan Nietfeld, Bodo Tasche

Many thanks to Florian, Klaus, Jan and Bodo for sharing their experiences with us. We are always excited to share stories from the community about how they are using Elasticsearch, so please reach out to me any time you'd like to share you story.