26 8월 2014 엔지니어링

Using the Percolator for Geo Tagging

By Alexander Reelsen

TLDR: This blog post will show you how to use the percolator to enrich documents. In this particular use case we will find out the country of a certain latitude/longitude pair.

Imagine you are using Yelp or Foursquare to search for restaurants. Your current location is characterized by latitude/longitude, as is the restaurant you are looking for. This is a typical geo use case that involves finding a bunch of points within a bounded area. So what if you already know the point, but want to find the bounded area?

Making the world JSON

In order to start, we need to have access to the borders of all countries and store them in Elasticsearch. The first step is really easy, as there are a couple of JSON files available that represent country shapes. In this post, we will stick with the GeoJSON based set of Johan Sundström, available in the world.geo.json github repo. I used countries.geo.json, which is available here.

The only drawback is that all the countries are in a JSON document in a single file. However ,we need a way to register a percolator per country.

Let’s start easy and create an index with a location type, that features a geo_shape.

PUT /countries
PUT /countries/location/_mapping
{
  "location" : {
    "properties" : {
      "location" : {
        "type" : "geo_shape"
      }
    }
  }
}

Registering percolators

The first step is to modify the existing JSON. Instead of going big and using Logstash for extraction, today we will stay on the command line and use the excellent underscore-cli. This client gives you the power of underscore on the command line, thanks to node.js.

If you want to install underscore on OSX, just use brew and npm

brew install node
npm -g install underscore-cli

Now, you can run underscore help on the command line.

IFS=n'
for i in $(underscore select ':has(:root > .id)' < countries.geo.json --outfmt text | grep -v '"id":"-99"') ; do
    country=$(echo "$i" | underscore extract id | tr -d '"')
    geometry=$(echo "$i" | underscore extract geometry)

    curl -X PUT localhost:9200/countries/.percolator/$country -d "{ \"query\" : { \"geo_shape\" : { \"location\" : { \"shape\" : $geometry } } }, \"type\" : \"location\" }"
done

Quite a few bits happened here, which you might not be familiar with. First the countries.geo.json file is parsed and for each id found an entry is extracted using the select function. As we rely on the id to identify the country, the -99 value is filtered out in order to not index invalid ids. The id is the three-digit ISO code (for example USA or DEU), which will be the name of the registered percolator. The last step is to extract the geo shape from the read JSON and then use the id and that JSON to register a percolation query.

In case you are wondering about the first line, it is to prevent the for loop from only treating newlines as the input field separator; by default, spaces are used as well.

After this script has run, you can verify with the count API that queries have been registered.

GET /countries/.percolator/_count

Finding countries

Now the infrastructure is set up and you can start querying. Whenever you get a document with latitude and longitude, you can easily find out the country that the location belongs to. See these example queries:

GET /countries/location/_percolate
{
  "doc": {
    "name": "Marienplatz Munich",
    "location": {
      "type": "point",
      "coordinates": [ 11.575448, 48.137393 ]
    }
  }
}

{
   ...
   "total": 1,
   "matches": [
      {
         "_index": "countries",
         "_id": "DEU"
      }
   ]
}


GET /countries/location/_percolate
{
  "doc": {
    "name": "Statue of Liberty",
    "location": {
      "type": "point",
      "coordinates": [ -74.0445, 40.689249 ]
    }
  }
}

{
   ...
   "total": 1,
   "matches": [
      {
         "_index": "countries",
         "_id": "USA"
      }
   ]
}


GET /countries/location/_percolate
{
  "doc": {
    "name": "Amsterdam",
    "location": {
      "type": "point",
      "coordinates": [ 4.8986166, 52.3747158 ]
    }
  }
}

{
   ...
   "total": 1,
   "matches": [
      {
         "_index": "countries",
         "_id": "NLD"
      }
   ]
}

As you can see, the _id field being returned now contains the three digit ISO code we used as an id to register the percolator. You could potentially use this result, extract the id and add it to the document in order to run aggregations per country.

In fact, this is just one example of how to use the percolator to enrich documents. You can consider this a pre-processing step. Even before you index your data, you can execute a percolate query and enrich your document with the data returned, and then index it. You could use this for categorization (e.g. if your document description matches a pizza query it might be a restaurant and you could tag it like that) or enrich documents in order to aggregate on them (how many check-ins happened in Germany in the last month?). Or you could just track movements of rental cars and make sure they never leave the country the were rented in.

You could try to get more accurate data (by state or by city or another region) and enrich based on that.

If you have more awesome (or rare!) use cases about the percolator, we would love to know about it! Share them on Twitter and who knows, they may even be featured on our site!