Tech Topics

Enriching Your Searches With Open Geo Data

Have you ever found some neat public datasets where you thought it might make sense to include it in your application to optimize a specific feature, even a minor feature? Sure you have! This blog post shows you how to use Logstash to get an external dataset in the desired format, using Kibana to check if it is applicable, and then making sure the data gets indexed correctly into Elasticsearch so it can do the heavy lifting in production.

Getting the data

I will assume from now on that you have a current Elasticsearch and Logstash installation available. I will use Elasticsearch 1.1.0 and Logstash 1.4.0 in these examples.

OpenGeoDB is a German website, which includes German geo information in SQL and CSV formats. As we want to store data in Elasticsearch, SQL is not an option, neither is CSV out of the box, however we can use Logstash to preprocess and index data in Elasticsearch.

The file we are going to parse is a list of German cities, including all of their zip codes. The dataset is available on the OpenGeoDB site and these files are licensed as public domain.

Data format

Taking a look at the data reveals it consists of a number of columns:

• loc_id: some unique identifier
• ags: official administrative id
• ascii: uppercase sort name
• name: name of the entity
• lat: coordinate latitude
• lon: coordinate longitude
• amt:
• plz: zip code (if more than one like for bigger cities, then comma separated)
• vorwahl: dial code
• einwohner: population
• flaeche: area
• kz: license tag
• typ: administrative name
• level: integer which defines its locality level
• of: maps as a ‘part of’ another id
• invalid: marked as invalid

Interesting fields to play around with might be name, lat, lon, area, population and license tag. More about that soon…

Indexing the data into Elasticsearch

Using the Logstash CSV filter

The next step is to get the data into Elasticsearch. The first step is to create a Logstash configuration, so I’ll copy this configuration into a file named opengeodb.conf. Note that we are using a filter called “csv”, even though the field separator is a tab, and not a comma.

input {
stdin {}
}
filter {
# Step 1, possible dropping
if [message] =~ /^#/ {
drop {}
}
# Step 2, splitting
csv {
# careful... there is a "tab" embedded in the next line:
# if you cannot copy paste it, press ctrl+V and then the tab key to create the control sequence
# or maybe just tab, depending on your editor
separator => '      '
quote_char => '|' # arbitrary, default one is included in the data and does not work
columns => [ 'id', 'ags', 'name_uc', 'name', 'lat', 'lon', 'official_description', 'zip', 'phone_area_code', 'population', 'area', 'plate', 'type', 'level', 'of', 'invalid' ]
}
# Step 3, possible dropping
if [level] != '6' {
drop {}
}
# Step 4, zip code splitting
if [zip] =~ /,/ {
mutate {
split => [ "zip", "," ]
}
}
# Step 5, lat/lon love
if [lat] and [lon] {
# move into own location object for additional geo_point type in ES
# copy field, then merge to create array for bettermap
mutate {
rename => [ "lat", "[location][lat]", "lon", "[location][lon]" ]
add_field => { "lonlat" => [ "%{[location][lon]}", "%{[location][lat]}" ] }
}
}
# Step 6, explicit conversion
mutate {
convert => [ "population", "integer" ]
convert => [ "area", "integer" ]
convert => [ "[location][lat]", "float" ]
convert => [ "[location][lon]", "float" ]
convert => [ "[lonlat]", "float" ]
}
}
output {
elasticsearch {
host => 'localhost'
index => 'opengeodb'
index_type => "locality"
flush_size => 1000
protocol => 'http'
}
}


Before going on, this is how you would call Logstash in order to index the data. You should fire up Elasticsearch first.

cat DE.tab | logstash-1.4.0/bin/logstash -f opengeodb.conf


So, quite a lot of stuff going on here (might take a minute, depending highly on your hardware). The first thing you might notice: the Logstash configuration does not ingest events using a file input. The reason for this is, that the file input behaves like a tail -f under UNIX systems and waits for new data to be appended to the file. The data file we have however has a defined end, so it is more appropriate to read of all its data using the stdin input.

The filter section consists of five steps. So, lets go through each and every step and explain what it does

The first step drops comments, which are defined as lines starting with a hash. This is needed because the supplied tab-separated file’s first line is such a comment, which contains the names of the columns, and there is no need to index that.

Step 2 – CSV extraction

The second step is doing all the hard CSV work. You have to define the separator as a tab, the quote_char is by default a single ", which is used in our data and thus has to be changed to something else. The columns setting defines the names of the columns, which are used as field names from then on.

Important: When you copy and paste the configuration file from above, the separator character might need to replaced, as it is copied and represented as a couple of spaces instead of tab. Please check that, in case it is not working as expected.

Step 3 – Removing certain entries

We only need the entries from the CSV file which represent a city (those have the level set to 6). We simply remove the rest by dropping all other events.

Step 4 – Zip code handling

The fourth step is for better zip code handling. If an entry has more than one zip code (like bigger cities), all the zip codes are contained in the field, but separated by commas. In order to store those as an array and not one big string, you can use the mutate filter to split that field. Storing this data in an array (as number) would allow you do to numeric range queries for example.

Step 5 – Geo data structures

The fifth step is to cater a little bit better for geo data. When reading from the DE.dat input file, lat and lon fields are created. However those fields only have a meaning when they are stored together. This logic stores both fields in two data structures. One looks like an Elasticsearch geo_point and results in a { "location" : { "lat": x, "lon": y }} structure. The other is a simple array and contains the longitude and latitude (in that order!), so we can use a Kibana bettermap to draw it.

Step 6 – Explicit field conversion

The last filter step is to explicitly set the data types of some fields, allowing numeric operations on those in Elasticsearch later.

The output section uses features available only in the Logstash 1.4 release and later, so make sure you use at least that version. In the former version you had to specify the elasticsearch_http output explicitly. Going forward, there will only be one elasticsearch output, and you can specify protocol => http to use HTTP over port 9200 to communicate with Elasticsearch.

Using kibana & elasticsearch to gather insights

After having indexed some data, we could use kibana to get some further insights. Using a bettermap widget and some small search queries like population:[100000 TO *] we could show every bigger city in Germany.

You can use this for a couple of things:

• Cities with the highest populations
• Find out the various cities, which use a certain license plate (for example GT for Gütersloh and sorroundings)
• Use a script aggregation to find out the areas with the most dense or sparse population per square kilometer – you could alternatively precalculate this in Logstash as well

Even though this is nice, it does not help to improve existing applications. We need to go deeper.

So, lets step back for a moment and check what useful things one could possibly do with this data. We’ve got cities, postal codes… and there are plenty of occasions in web applications, where one has to enter exactly these data points.

A good example is during the checkout process of a purchase, as not every system has all their customer’s data saved already; you may be the kind of shop that frequently processes one-time orders or simply allows to orders without user registration. In this case, it might make sense to help the customer to speed up the order process, with the bonus of preventing order loss or cancellation due to a difficult purchase flow.

Elasticsearch has a very fast prefix suggester called the completion suggester. This suggester has the disadvantage that you need to enrich the results a bit before indexing your data, but this is exactly what Logstash is for. In order to best understand this example, you may want to take a look at this introduction to the completion suggester.

The completion suggester

So, let’s decide that we want to help the user typing in the city he/she lives in. And we also want to provide a list of zip codes, so it is easier to find the right one once the correct city has been selected. You could also do this vice versa and let the user type the zip code and then help by autofilling the city information.

Time to add a couple of things to the Logstash configuration to make this work. Lets start with the easy part, the configuration inside of the filter. Add this snippet of configuration to your opengeodb.conf, right after step 5 and before step 6:

  # Step 5 and a half
# create a prefix completion field data structure
# input can be any of the zips or the name field
# weight is the population, so big cities are preferred when the city name is entered
mutate {
add_field => [ "[suggest][input]", "%{name}" ]
add_field => [ "[suggest][output]", "%{name}" ]
add_field => [ "[suggest][weight]", "%{population}" ]
}
# add all the zips to the input as well
mutate {
merge => [ "[suggest][input]", zip ]
convert => [ "[suggest][weight]", "integer" ]
}
# ruby filter to put an array into the event
ruby {
}


Logstash will now write a suggest compatible data structure when indexing certain datasets. However we need to configure a template for the mapping to have the suggest feature configured in Elasticsearch as well. Therefore you also need to change the default template for Logstash in the elasticsearch output.

# change the output to this in order to include an index template
output {
elasticsearch {
host => 'localhost'
index => 'opengeodb'
index_type => "locality"
flush_size => 1000
protocol => 'http'
template_name => 'opengeodb'
template => '/path/to/opengeodb-template.json'
}
}


The template is very similar to the default Logstash template, but adds the suggest and the geo_point fields.

{
"template" : "opengeodb",
"settings" : {
"index.refresh_interval" : "5s"
},
"mappings" : {
"_default_" : {
"_all" : {"enabled" : true},
"dynamic_templates" : [ {
"string_fields" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "analyzed",
"omit_norms" : true,
"fields" : {
"raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
}
}
}
} ],
"properties" : {
"@version": { "type": "string", "index": "not_analyzed" },
"location" : { "type" : "geo_point" },
"suggest" : { "type": "completion", "payloads" : true, "analyzer" : "whitespace" }
}
}
}
}


Now it is time to delete your old data (including the index) and reindex

curl -X DELETE localhost:9200/opengeodb
cat DE.tab | logstash-1.4.0/bin/logstash -f opengeodb.conf


Next step is to actually execute a suggestion

curl -X GET 'localhost:9200/opengeodb/_suggest?pretty' -d '{
"places" : {
"text" : "B",
"completion" : {
"field" : "suggest"
}
}
}'


And this is the result

{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"places" : [ {
"text" : "B",
"offset" : 0,
"length" : 1,
"options" : [ {
"text" : "Berlin",
"score" : 3431675.0, "payload" : {"data":["Berlin","10115","10117","10119","10178","10179","10243","10245","10247","10249","10315","10317","10318","10319","10365","10367","10369","10405","10407","10409","10435","10437","10439","10551","10553","10555","10557","10559","10585","10587","10589","10623","10625","10627","10629","10707","10709","10711","10713","10715","10717","10719","10777","10779","10781","10783","10785","10787","10789","10823","10825","10827","10829","10961","10963","10965","10967","10969","10997","10999","12043","12045","12047","12049","12051","12053","12055","12057","12059","12099","12101","12103","12105","12107","12109","12157","12159","12161","12163","12165","12167","12169","12203","12205","12207","12209","12247","12249","12277","12279","12305","12307","12309","12347","12349","12351","12353","12355","12357","12359","12435","12437","12439","12459","12487","12489","12524","12526","12527","12529","12555","12557","12559","12587","12589","12619","12621","12623","12627","12629","12679","12681","12683","12685","12687","12689","13051","13053","13055","13057","13059","13086","13088","13089","13125","13127","13129","13156","13158","13159","13187","13189","13347","13349","13351","13353","13355","13357","13359","13403","13405","13407","13409","13435","13437","13439","13442","13465","13467","13469","13503","13505","13507","13509","13581","13583","13585","13587","13589","13591","13593","13595","13597","13599","13627","13629","14050","14052","14053","14055","14057","14059","14089","14109","14129","14163","14165","14167","14169","14193","14195","14197","14199"]}
}, {
"text" : "Bremen",
"score" : 545932.0, "payload" : {"data":["Bremen","28195","28203","28205","28207","28209","28211","28213","28215","28217","28219","28237","28239","28307","28309","28325","28327","28329","28355","28357","28359","28717","28719","28755","28757","28759","28777","28779","28197","28199","28201","28259","28277","28279"]}
}, {
"text" : "Bochum",
"score" : 388179.0, "payload" : {"data":["Bochum","44787","44789","44791","44793","44795","44797","44799","44801","44803","44805","44807","44809","44866","44867","44869","44879","44892","44894"]}
}, {
"text" : "Bielefeld",
"score" : 328012.0, "payload" : {"data":["Bielefeld","33602","33604","33605","33607","33609","33611","33613","33615","33617","33619","33647","33649","33659","33689","33699","33719","33729","33739"]}
}, {
"text" : "Bonn",
"score" : 311938.0, "payload" : {"data":["Bonn","53111","53113","53115","53117","53119","53121","53123","53125","53127","53129","53173","53175","53177","53179","53225","53227","53229"]}
} ]
} ]
}


As you can see now, it makes a lot of sense to use the population of a city as weight. Using this, the bigger cities are suggested before smaller ones. The returned payload contains the name of the city and all the different zip codes, which could be used to automatically fill a form (especially if it is only one zip code is returned).

And that’s it for today! However, keep in mind that this is not about public datasets. I am pretty sure that somewhere, deeply buried inside your company, someone has gathered some insightful data which only waits to get enriched and used to improve your applications. Ask your colleagues – you will find these kinds of data sets in every company!

Also, if you find an interesting public (or internal datasets that you can talk about) let us know how you’re using it with Elasticsearch. We are always interested how you’re improved your applications with (open) data and would love to hear your story!