May 29, 2015

Combining Geo Points With the Elasticsearch Percolator

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Elasticsearch has some very good support for geo locations. You can store locations, use distances to filter and even use distances to group using aggregations. Elasticsearch also has something called geo shapes - no worries, we'll explain later on what they are - to select locations within a certain area. If you combine these geo shapes with filters and something called the percolator, you can create a basic classification system. Curious? Read on to learn about the Elasticsearch percolator and geo support in Elasticsearch. To explain the concepts we use a sample application based on spring boot and angularjs. Before we jump into the technical stuff, we'll talk about the data lifecycle that you need for all data projects.

Data Lifecycle, 5 Stages of Handling Data

The first step is to obtain the data, the second step to transform the data into the form we want to store it. This is the third step, storing the data. The fourth step is to use the data in reports, analysis or in our case a dashboard. The final step is to learn from the data, improve the import or transport process or track how people are using your data and use the usage data as new data.

In this sample we use two data sources. A csv file with all the Dutch zip codes and a JavaScript file containing the geo shapes for the Dutch provinces.

Obtaining the Data

csv file with Dutch zip codes

You can download the csv file from the following website: http://www.postcodedata.nl/download/

The following code block show one line of the csv file:

395614;"7940XX";79408888;7940;"XX";4;12;"mixed";"Troelstraplein";"Meppel";1082;"Meppel";119;"Drenthe";"DR";"52.7047653217626";"6.1977201775604";"209781.52077777777777777778";"524458.25733333333333333333";"postcode";"2014-04-10 13:20:28"

JS file with the Dutch provinces

The province shape data is coming from a small GitHub project containing the geo shapes for all the Dutch provinces: https://gist.github.com/lekkerduidelijk/4387055

Transforming the Data

Transformation of the zip codes was not too hard. We had a csv file that was imported into our application. It was mapped onto a much easier object containing less fields. Only the required fields were kept.

The province data was a two dimensional array, the first dimension representing the different provinces. The second dimension is a set of strings with lat,lon doubles in there. We manually extracted the different provinces into a file for each province containing the one dimensional array containing strings. So the files look like this: [“52.829113,6.151108”,“52.814176,6.177200”, …]

The string containing two doubles might seem unuseful and a first citizen to be transformed, however as we will discuss in the Elasticsearch geo support section, we can create a GeoPoint from such a string.

Storing the Data

The zip codes were imported using a Java application that created a number of index requests. With Java and the Jackson library creating documents to index becomes easy, just create the object, serialize it using Jackson and pass it on to the Elasticsearch API. In a later section we have a short look at the structure of the object.

Often when using Elasticsearch, you will create a lot of documents. In the percolator case we create just 12 documents, they are special documents though. They are percolator queries, this is something special from Elasticsearch where you store the query instead of the document and then match documents against queries. The queries we store will be filtered queries using the Geo Polygon Filter.

Using the Data

To show what we can do with geo shapes and the percolator we want an application with maps. Without maps, what is the use for geo locations. We are going to show you how to find geo locations for cities, then using these locations we use the percolator to find the province for the location. Finally, we use the percolator geo shape as stored in the query to highlight the province on the map. The next image shows the application we are creating.

screenshot created application

By now you should have an idea what we did with the data, in the next section we are explaining the basic support for geo in Elasticsearch.

Elasticsearch Geo Support

Indexing Geo Points

Elasticsearch can index geo_points, a geo_point is represented by two doubles. One double being the latitude and the other being the longitude. Using this point Elasticsearch can perform calculations like distances and do grouping using distances as well. If you want to store a location as a geo_point, you have to explicitly mention this in the mapping. If you do not do this, you will end up with just to fields of type double. The following code block shows the mapping for a field called location of type geo_point.

{
  "properties": {
    "location": {
      "type": "geo_point"
    }
  }
}

With the mapping in place we can add geo_points. There are three different formats for inserting geo points:

string -> location: “lat,lon”
object -> location: {“lat”: .., “lon”: ..}
array -> location: [lon,lat]

Notice the difference in order of lat and lon in options 1 and 3. The reason for this difference can historically be tracked to how cartographers disagree with computer scientists on what is the correct order. We prefer option 2, as we can be explicit about what the lat is and what the lon is.

Read more about the geo_point type in the reference manual of Elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-geo-point-type.html

In the sample project we model our entities as Plain Old Java Objects (POJO) and serialize them using Jackson. Our PostalCode object contains an object Location which has two properties called lat and lon. When serializing this object to a string based JSON notation, this will result in option 2 object based notation of the geo_type.

{
  "location": {
    "lat": ...,
    "lon":...
  }
}

Filtering on Geo Points

If you have an index available with some geo points, you can create filters to select documents based on these geo points. There are a number of filters available:

geo_bounding_box -> A rectangle where all points need to be in
geo_distance -> A central point with a radius
geo_distance_range -> More like a donut shape
geo_polygon -> Create your own polygon using locations

To get a feeling about the structure of such a request, let us have a look at a query that makes use of the geo_distance filter. In my database I have all zip codes of the Netherlands and I would like to filter on all postal codes that are within a 1 km radius of my home. Before we can do this query, we need to find the coordinates for my home. Of course I can query the postal code database, but it is more fun to use a specific website for this. Using this website I get the coordinates of my home: http://www.latlong.net/convert-address-to-lat-long.html

The coordinates are: lat-52.060669,lon-4.494025

The query that would find all zip codes within a 1km radius of my home would be:

GET /geostuff/_search
{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance": {
          "distance": "1km",
          "location": {
            "lat": 52.060669,
            "lon": 4.494025
          }
        }
      }
    }
  }
}

If you want more information on the geo filters, there is this very good online resource: http://www.elastic.co/guide/en/elasticsearch/guide/current/geoloc.html

In the application that we are creating, we want to find the province where a city is in. The way we want to do this is by using the GeoPolygonFilter. This is an expensive filter, but for the amount of points we have it is good to use. The input for this filter is a set of points that together create a polygon. It would look for all the points that fall within that polygon. So to find all zip codes for the province zuid-holland we could use the following query:

GET /geostuff/_search
{
  "query": {
    "filtered": {
      "filter": {
        "geo_polygon": {
          "location": {
            "points": [
              "52.275300,4.453033",
              "52.224435,4.410805",
              ... left out around 250 other points
              "52.309633,4.564373",
              "52.328625,4.492962"
            ]
          }
        }
      }
    }
  }
}

How does this helps us to find the province for a city? That is where the percolator comes in, this percolator is the main topic for the next section.

Using Percolator to Find Provinces

The percolator is what is called an inverse query. What is meant with that is that we store the query and use a document to find a query that matches the document. Usually this would be the other way around, store documents and use a query to find documents. So why is this interesting? Take a look at the previous code sample, this is a query for one province. We have 12 in the Netherlands. What we do is store all 12 queries in the percolator and then use the coordinates of a city to go over all queries and see if one matches. The matching query should be for the province that the city is.

Storing the Percolated Queries

Storing a percolator query is the same as any other document, you just store them in a separate type of your index. You can store them in their own index, that does not really change the idea. So for now, store them as a separate type. In our sample we store the queries using the following Java code.

  &public void doCreatePercolatorQuery(String province) {
    client.prepareIndex(GEO_INDEX, ".percolator", "province_" + province)
        .setSource(jsonBuilder()
            .startObject()
                .field("query", createQuery(province + ".txt"))
                .field("province", province)
            .endObject()
        ).setRefresh(true).get();
  }

  private FilteredQueryBuilder createQuery(String provincePoints) {
    final List polygon = new ArrayList<>();
    List geo = null;
    try {
      geo = new ObjectMapper().readValue(new ClassPathResource(provincePoints).getInputStream(), List.class);
    } catch (IOException e) {
      throw new RuntimeException("Cannot read province file with points " + provincePoints);
    }

    geo.stream().forEach(s -> {
      polygon.add(new GeoPoint(s));
    });

    GeoPolygonFilterBuilder geoPolygonFilterBuilder = FilterBuilders.geoPolygonFilter("location");
    polygon.stream().forEach(geoPolygonFilterBuilder::addPoint);
    return filteredQuery(matchAllQuery(), geoPolygonFilterBuilder);
  }

Notice how we use the jsonBuilder to create the percolator document for the index GEO_INDEX and type .percolator. Each document gets the id equal to the name of the province prepended with the word province. Then in the field query we store the create query obtained from the method createQuery. In this method we read a file containing the array with all the geo_points, just like in the example query in one of the previous code blocks. Executing this code for all provinces gives us 12 percolator queries in the index GEO_INDEX, which is a constant containing geostuff by the way.

Using the Percolator End Point

Now we want to use the coordinates of my home town to find the province using the percolator API. The following code block shows the request in JSON format and well as the response.

GET /geostuff/locations/_percolate
{
  "doc": {
    "location": {
      "lat": 52.060893,
      "lon": 4.534213
    }
  }
}

The response:
{
   "took": 3,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "total": 1,
   "matches": [
      {
         "_index": "geostuff-20150306114746",
         "_id": "province_zuidholland"
      }
   ]
}

Notice that we only get back the _id of the document, we can store more fields in the percolator document, however to obtain them we have to explicitly request the document using the standard get request. For the sample application we want to print the found province on a map and therefore we need all the points of the shape, and so we have to obtain the percolated document. To be complete the next code block shows the Java code to do the same.

    public String checkLocationForProvince(double longitude, double latitude) {
      XContentBuilder docToCheck = jsonBuilder()
            .startObject()
            .startObject("location")
            .field("lat", latitude)
            .field("lon", longitude)
            .endObject()
            .endObject();
      PercolateSourceBuilder.DocBuilder builder = new PercolateSourceBuilder.DocBuilder();
      builder.setDoc(docToCheck);

      PercolateResponse matches = client.preparePercolate()
            .setPercolateDoc(builder)
            .setIndices("geostuff")
            .setDocumentType("locations")
            .get();
    if (matches.getMatches().length > 0) {
      return matches.getMatches()[0].getId().toString();
    }
    return "Not in a province";
  }

The Sample Application

You can find the sample application on my GitHub account: https://github.com/jettro/geo-elastic

The sample is created using spring boot and maven. If you have that installed you can run it using: mvn package & Java -jar target/geo-elastic-0.1-SNAPSHOT.jar

It tries to connect to a cluster named performance, if you want a different cluster name change the name of the cluster in src/main/resources/application.properties

If you want to import the postal codes as well, you have to download the following file: http://download.postcodedata.nl/data/postcode_NL_head.csv.zip

Unzip the file into the project directory: src/main/resources and start the application as mentioned before. Now go to the maintain tab (see the following image) and push the Restore percolators button and after that the Restore postal codes button. If everything goes well you should see 471993 postal codes right below the restore button.

screenshot maintain page

Before leaving you experimenting with this code I want to mention the angular-leaflet-directive, which made it very easy to integrate the map powered by leaflet: https://github.com/tombatossals/angular-leaflet-directive

Summarizing

That’s it, we went from discussing the Data Lifecycle to showing it in practice with a sample based on zip codes and provinces in The Netherlands. You have seen how to import a csv file, create the mapping for Elasticsearch for geo_points. With the indexed data we went on to introduce geo filters and especially the geo polygon filter. In the end all this was used with the percolator to find the province for a city based on its coordinates.

The final step in the data lifecycle is learning from how your application is used. Since this is a sample application this might feel artificial. A good scenario could be to record what cities people search for, if a city is searched for more often we can add it to the presets. Also we can count the number of found provinces and change the colors of the provinces that are found most often.

Now imagine what cool map based products you can create, because we all want cool map based products.