Enriching Your Searches With Open Geo Data
Have you ever found some neat public datasets where you thought it might make sense to include it in your application to optimize a specific feature, even a minor feature? Sure you have! This blog post shows you how to use Logstash to get an external dataset in the desired format, using Kibana to check if it is applicable, and then making sure the data gets indexed correctly into Elasticsearch so it can do the heavy lifting in production.
Getting the data
I will assume from now on that you have a current Elasticsearch and Logstash installation available. I will use Elasticsearch 1.1.0 and Logstash 1.4.0 in these examples.
OpenGeoDB is a German website, which includes German geo information in SQL and CSV formats. As we want to store data in Elasticsearch, SQL is not an option, neither is CSV out of the box, however we can use Logstash to preprocess and index data in Elasticsearch.
The file we are going to parse is a list of German cities, including all of their zip codes. The dataset is available on the OpenGeoDB site and these files are licensed as public domain.
Data format
Taking a look at the data reveals it consists of a number of columns:
loc_id
: some unique identifierags
: official administrative idascii
: uppercase sort namename
: name of the entitylat
: coordinate latitudelon
: coordinate longitudeamt
:plz
: zip code (if more than one like for bigger cities, then comma separated)vorwahl
: dial codeeinwohner
: populationflaeche
: areakz
: license tagtyp
: administrative namelevel
: integer which defines its locality levelof
: maps as a ‘part of’ another idinvalid
: marked as invalid
Interesting fields to play around with might be name
, lat
, lon
, area
, population
and license tag
. More about that soon…
Indexing the data into Elasticsearch
Using the Logstash CSV filter
The next step is to get the data into Elasticsearch. The first step is to create a Logstash configuration, so I’ll copy this configuration into a file named opengeodb.conf
. Note that we are using a filter called “csv”, even though the field separator is a tab, and not a comma.
input {
stdin {}
}
filter {
# Step 1, possible dropping
if [message] =~ /^#/ {
drop {}
}
# Step 2, splitting
csv {
# careful... there is a "tab" embedded in the next line:
# if you cannot copy paste it, press ctrl+V and then the tab key to create the control sequence
# or maybe just tab, depending on your editor
separator => ' '
quote_char => '|' # arbitrary, default one is included in the data and does not work
columns => [ 'id', 'ags', 'name_uc', 'name', 'lat', 'lon', 'official_description', 'zip', 'phone_area_code', 'population', 'area', 'plate', 'type', 'level', 'of', 'invalid' ]
}
# Step 3, possible dropping
if [level] != '6' {
drop {}
}
# Step 4, zip code splitting
if [zip] =~ /,/ {
mutate {
split => [ "zip", "," ]
}
}
# Step 5, lat/lon love
if [lat] and [lon] {
# move into own location object for additional geo_point type in ES
# copy field, then merge to create array for bettermap
mutate {
rename => [ "lat", "[location][lat]", "lon", "[location][lon]" ]
add_field => { "lonlat" => [ "%{[location][lon]}", "%{[location][lat]}" ] }
}
}
# Step 6, explicit conversion
mutate {
convert => [ "population", "integer" ]
convert => [ "area", "integer" ]
convert => [ "[location][lat]", "float" ]
convert => [ "[location][lon]", "float" ]
convert => [ "[lonlat]", "float" ]
}
}
output {
elasticsearch {
host => 'localhost'
index => 'opengeodb'
index_type => "locality"
flush_size => 1000
protocol => 'http'
}
}
Before going on, this is how you would call Logstash in order to index the data. You should fire up Elasticsearch first.
cat DE.tab | logstash-1.4.0/bin/logstash -f opengeodb.conf
So, quite a lot of stuff going on here (might take a minute, depending highly on your hardware). The first thing you might notice: the Logstash configuration does not ingest events using a file
input. The reason for this is, that the file input behaves like a tail -f
under UNIX systems and waits for new data to be appended to the file. The data file we have however has a defined end, so it is more appropriate to read of all its data using the stdin
input.
The filter
section consists of five steps. So, lets go through each and every step and explain what it does
Step 1 – Exclude comments
The first step drops comments, which are defined as lines starting with a hash. This is needed because the supplied tab-separated file’s first line is such a comment, which contains the names of the columns, and there is no need to index that.
Step 2 – CSV extraction
The second step is doing all the hard CSV work. You have to define the separator
as a tab, the quote_char
is by default a single "
, which is used in our data and thus has to be changed to something else. The columns
setting defines the names of the columns, which are used as field names from then on.
Important
: When you copy and paste the configuration file from above, theseparator
character might need to replaced, as it is copied and represented as a couple of spaces instead of tab. Please check that, in case it is not working as expected.
Step 3 – Removing certain entries
We only need the entries from the CSV file which represent a city (those have the level
set to 6
). We simply remove the rest by dropping all other events.
Step 4 – Zip code handling
The fourth step is for better zip code handling. If an entry has more than one zip code (like bigger cities), all the zip codes are contained in the field, but separated by commas. In order to store those as an array and not one big string, you can use the mutate
filter to split that field. Storing this data in an array (as number) would allow you do to numeric range queries for example.
Step 5 – Geo data structures
The fifth step is to cater a little bit better for geo data. When reading from the DE.dat
input file, lat
and lon
fields are created. However those fields only have a meaning when they are stored together. This logic stores both fields in two data structures. One looks like an Elasticsearch geo_point and results in a { "location" : { "lat": x, "lon": y }}
structure. The other is a simple array and contains the longitude and latitude (in that order!), so we can use a Kibana bettermap to draw it.
Step 6 – Explicit field conversion
The last filter step is to explicitly set the data types of some fields, allowing numeric operations on those in Elasticsearch later.
The output
section uses features available only in the Logstash 1.4 release and later, so make sure you use at least that version. In the former version you had to specify the elasticsearch_http
output explicitly. Going forward, there will only be one elasticsearch
output, and you can specify protocol => http
to use HTTP over port 9200 to communicate with Elasticsearch.
Using kibana & elasticsearch to gather insights
After having indexed some data, we could use kibana to get some further insights. Using a bettermap widget and some small search queries like population:[100000 TO *]
we could show every bigger city in Germany.
You can use this for a couple of things:
- Cities with the highest populations
- Find out the various cities, which use a certain license plate (for example GT for Gütersloh and sorroundings)
- Use a script aggregation to find out the areas with the most dense or sparse population per square kilometer – you could alternatively precalculate this in Logstash as well
Even though this is nice, it does not help to improve existing applications. We need to go deeper.
Adding suggestions
So, lets step back for a moment and check what useful things one could possibly do with this data. We’ve got cities, postal codes… and there are plenty of occasions in web applications, where one has to enter exactly these data points.
A good example is during the checkout process of a purchase, as not every system has all their customer’s data saved already; you may be the kind of shop that frequently processes one-time orders or simply allows to orders without user registration. In this case, it might make sense to help the customer to speed up the order process, with the bonus of preventing order loss or cancellation due to a difficult purchase flow.
Elasticsearch has a very fast prefix suggester called the completion suggester
. This suggester has the disadvantage that you need to enrich the results a bit before indexing your data, but this is exactly what Logstash is for. In order to best understand this example, you may want to take a look at this introduction to the completion suggester.
The completion suggester
So, let’s decide that we want to help the user typing in the city he/she lives in. And we also want to provide a list of zip codes, so it is easier to find the right one once the correct city has been selected. You could also do this vice versa and let the user type the zip code and then help by autofilling the city information.
Time to add a couple of things to the Logstash configuration to make this work. Lets start with the easy part, the configuration inside of the filter. Add this snippet of configuration to your opengeodb.conf
, right after step 5 and before step 6:
# Step 5 and a half
# create a prefix completion field data structure
# input can be any of the zips or the name field
# weight is the population, so big cities are preferred when the city name is entered
mutate {
add_field => [ "[suggest][input]", "%{name}" ]
add_field => [ "[suggest][output]", "%{name}" ]
add_field => [ "[suggest][payload][name]", "%{name}" ]
add_field => [ "[suggest][weight]", "%{population}" ]
}
# add all the zips to the input as well
mutate {
merge => [ "[suggest][input]", zip ]
convert => [ "[suggest][weight]", "integer" ]
}
# ruby filter to put an array into the event
ruby {
code => 'event["[suggest][payload][data]"] = event["zip"]'
}
Logstash will now write a suggest compatible data structure when indexing certain datasets. However we need to configure a template for the mapping to have the suggest feature configured in Elasticsearch as well. Therefore you also need to change the default template for Logstash in the elasticsearch
output.
# change the output to this in order to include an index template
output {
elasticsearch {
host => 'localhost'
index => 'opengeodb'
index_type => "locality"
flush_size => 1000
protocol => 'http'
template_name => 'opengeodb'
template => '/path/to/opengeodb-template.json'
}
}
The template is very similar to the default Logstash template, but adds the suggest
and the geo_point
fields.
{
"template" : "opengeodb",
"settings" : {
"index.refresh_interval" : "5s"
},
"mappings" : {
"_default_" : {
"_all" : {"enabled" : true},
"dynamic_templates" : [ {
"string_fields" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "analyzed",
"omit_norms" : true,
"fields" : {
"raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
}
}
}
} ],
"properties" : {
"@version": { "type": "string", "index": "not_analyzed" },
"location" : { "type" : "geo_point" },
"suggest" : { "type": "completion", "payloads" : true, "analyzer" : "whitespace" }
}
}
}
}
Now it is time to delete your old data (including the index) and reindex
curl -X DELETE localhost:9200/opengeodb
cat DE.tab | logstash-1.4.0/bin/logstash -f opengeodb.conf
Next step is to actually execute a suggestion
curl -X GET 'localhost:9200/opengeodb/_suggest?pretty' -d '{
"places" : {
"text" : "B",
"completion" : {
"field" : "suggest"
}
}
}'
And this is the result
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"places" : [ {
"text" : "B",
"offset" : 0,
"length" : 1,
"options" : [ {
"text" : "Berlin",
"score" : 3431675.0, "payload" : {"data":["Berlin","10115","10117","10119","10178","10179","10243","10245","10247","10249","10315","10317","10318","10319","10365","10367","10369","10405","10407","10409","10435","10437","10439","10551","10553","10555","10557","10559","10585","10587","10589","10623","10625","10627","10629","10707","10709","10711","10713","10715","10717","10719","10777","10779","10781","10783","10785","10787","10789","10823","10825","10827","10829","10961","10963","10965","10967","10969","10997","10999","12043","12045","12047","12049","12051","12053","12055","12057","12059","12099","12101","12103","12105","12107","12109","12157","12159","12161","12163","12165","12167","12169","12203","12205","12207","12209","12247","12249","12277","12279","12305","12307","12309","12347","12349","12351","12353","12355","12357","12359","12435","12437","12439","12459","12487","12489","12524","12526","12527","12529","12555","12557","12559","12587","12589","12619","12621","12623","12627","12629","12679","12681","12683","12685","12687","12689","13051","13053","13055","13057","13059","13086","13088","13089","13125","13127","13129","13156","13158","13159","13187","13189","13347","13349","13351","13353","13355","13357","13359","13403","13405","13407","13409","13435","13437","13439","13442","13465","13467","13469","13503","13505","13507","13509","13581","13583","13585","13587","13589","13591","13593","13595","13597","13599","13627","13629","14050","14052","14053","14055","14057","14059","14089","14109","14129","14163","14165","14167","14169","14193","14195","14197","14199"]}
}, {
"text" : "Bremen",
"score" : 545932.0, "payload" : {"data":["Bremen","28195","28203","28205","28207","28209","28211","28213","28215","28217","28219","28237","28239","28307","28309","28325","28327","28329","28355","28357","28359","28717","28719","28755","28757","28759","28777","28779","28197","28199","28201","28259","28277","28279"]}
}, {
"text" : "Bochum",
"score" : 388179.0, "payload" : {"data":["Bochum","44787","44789","44791","44793","44795","44797","44799","44801","44803","44805","44807","44809","44866","44867","44869","44879","44892","44894"]}
}, {
"text" : "Bielefeld",
"score" : 328012.0, "payload" : {"data":["Bielefeld","33602","33604","33605","33607","33609","33611","33613","33615","33617","33619","33647","33649","33659","33689","33699","33719","33729","33739"]}
}, {
"text" : "Bonn",
"score" : 311938.0, "payload" : {"data":["Bonn","53111","53113","53115","53117","53119","53121","53123","53125","53127","53129","53173","53175","53177","53179","53225","53227","53229"]}
} ]
} ]
}
As you can see now, it makes a lot of sense to use the population of a city as weight
. Using this, the bigger cities are suggested before smaller ones. The returned payload contains the name of the city and all the different zip codes, which could be used to automatically fill a form (especially if it is only one zip code is returned).
And that’s it for today! However, keep in mind that this is not about public datasets. I am pretty sure that somewhere, deeply buried inside your company, someone has gathered some insightful data which only waits to get enriched and used to improve your applications. Ask your colleagues – you will find these kinds of data sets in every company!
Also, if you find an interesting public (or internal datasets that you can talk about) let us know how you’re using it with Elasticsearch. We are always interested how you’re improved your applications with (open) data and would love to hear your story!