15 juillet 2015 Cas Utilisateur

Solving the local data problem with Elasticsearch and Docker Compose

Par Shelby Switzer

Shelby Switzer is a software engineer and veteran nomad who recently settled in Denver, CO. Civic hacking in South Carolina was her first introduction to programming, and she has worked with brigades in Atlanta and Denver since then. She currently builds web services at Notion, where she's passionate about connecting IoT to the real world through the power of beautiful APIs. When not playing around with code, she can be found rafting or enjoying a delicious falafel.

Open data at the local government level may seem “smaller” but actually faces bigger challenges than those at the national level.  While there are increasing efforts to open government data sets nationally, cities, towns, counties, and even states are often left behind. These smaller entities don't have the resources to implement APIs for their data, and community tech groups and developers who want to use this data constantly encounter obstacles ranging from fragmented, non-uniform data to finding (and funding) API hosting and maintenance.

But what if we could throw all that data in a box that's easy to open, close, and move around? What if we could bypass traditional solutions requiring infrastructure for hosting and maintenance? Enter Docker and Elasticsearch, and a simple three-layer, API-in-a-box solution that any developer can immediately turn on with docker-compose up.

The Need

Local brigades of Code for America (CfA) meet regularly across the country and even the world to bring together citizens from different backgrounds to improve their community and make government more accessible. This means routinely working with data provided by governments, typically in the form of Excel files, CSVs, other spreadsheets, and even PDFs, all containing interesting data such as government employee salaries or tax digests documenting business registration. These data sets can consist of anywhere between ten rows and a few thousand rows — what I call “small data.”

Developers working on these projects come from very different backgrounds with varying development environments, and must determine the best way to use this data and make it accessible for not only themselves but for other citizens and projects. The projects typically don’t have funding, so they often can’t be hosted, even though most engineers are familiar with interacting with and building clients for HTTP APIs.

In sum, any solution, or set of solutions, to these various projects and their small data needs must:

  • be easy (and free) to store and maintain data
  • require little or no changes to the raw data
  • have or can integrate with a flexible, robust API (that doesn't need to be hosted)
  • be simple to set up locally or on any server

The Solution

While Elasticsearch is typically thought of in the context of big data and huge amounts of text, it rose to the top of the list for solutions to working with the small data we encounter in civic hacking. The spreadsheets of data we often work with are incomplete, have only text data types, contain duplicates, and contain fields that no one (not even the government officials who provided the sources) can explain but should probably stick around in case they become useful later. Elasticsearch has a RESTful JSON API that is familiar to most developers and very intuitive to work with. Its powerful functions allow for the easy sorting, de-duping, and searching of data without actually changing or deleting raw data.

We can then put a hypermedia API layer over Elasticsearch so that we have a uniform interface for different data sets that is not only consumable by generic hypermedia clients but also exposes the queries that are available for this data. For example, if my data has Name, Lat, Long, and Business Type fields, then a Collection+JSON hypermedia API can expose those queries simply through the JSON response for the resource, so that anyone consuming this API knows what’s possible even without documentation.

But Elasticsearch and a hypermedia API need hosting just like any other database or server, so how can we use them for a solution that can be simply set up anywhere and not need paid hosting? Enter Docker.

With a Docker container for Elasticsearch and another for the hypermedia API, and Docker Compose to link and start them together smoothly, we can create an API experience that is uniform, uses remote data, and has powerful search and data manipulation functionality, all without needing to be hosted anywhere beyond your local laptop.

The final process is made up of these four easy steps:

  1. Put raw spreadsheet(s) in a free Github repository (or other open platform)
  2. Install Docker and pull the API-in-a-Box code or image (or make your own)
  3. Run docker-compose up (this command then spins up the Elasticsearch and web server containers, and pulls the data from the remote storage and dumps it into Elasticsearch)
  4. Then hit the API with uniform endpoints, like http://localhost:4567/resources, which returns all of the resources in JSON. It also includes metadata describing the available query parameters for filtering by each field (e.g. “name”, “lat”, “long”, “business_type”) as well as ones utilizing Elasticsearch’s functionality, such as “dedupe_by”, “order_by”, etc. 

With an API-in-a-Box solution using Elasticsearch and Docker, anyone, anywhere, once they have the raw data, only needs to put that data in an accessible place, and then they can spin up a robust, flexible API on top of the data to be able to use it in any application they might want to build. It removes the local data problem from the civic hacking equation.