Engineering

# Enriching data with GeoIPs from internal, private IP addresses

For public IPs, it is possible to create tables that will specify which city specific ranges of IPs belong to. However, a big portion of the internet is different. There are company private networks with IP addresses of the form 10.0.0.0/8, 172.16.0.0/12 or 192.168.0.0/16 scattered in every country in the world. These IP addresses tend to have no real information for the geographic locations. Because of that, the geoip filter/processor built into Elasticsearch and Logstash won’t work with these private IPs.

Elasticsearch and Logstash have the option to specify a specific database file to use (database/database_file), so in theory this could be built custom. However, this can be time consuming to build and costly to maintain, not to mention you potentially have to learn a new set of tooling just to build these .mmdb files (a topic for a later blog).

An easier method can be implemented with another tool already built into Elasticsearch —  the enrich processor —  which can (very unsurprisingly) enrich our documents with any kind of data we want, including geo data. Let’s walk through an example of doing just that.

## Example: Enriching private IPs with continent, country, city

For this example, I will only go down to the level of cities in an arbitrary selection of countries, but it will be easy to see how this can be made more accurate down to the level of villages, or even specific buildings all the way out to just countries or even continents depending on the specific use case requirements.

PUT private_geoips
{
"mappings": {
"properties": {
"city_name": {
"type": "keyword"
},
"continent_name": {
"type": "keyword"
},
"country_iso_code": {
"type": "keyword"
},
"country_name": {
"type": "keyword"
},
"location": {
"type": "geo_point"
},
"source.ip": {
"type": "ip"
}
}
}
}


To implement something like this, we need to have control over which private IPs will be used in which locations for our offices. Given the nature of private networks, there is always the chance for overlaps to occur. For this small example, we will imagine we can control some subnets for our different offices and I have assigned them as below:

• 10.4.0.0/16 - Pretoria South Africa
• 10.6.0.0/16 - Berlin Germany
• 10.7.0.0/16 - Tokyo Japan
POST private_geoips/_bulk
{"index":{"_id":"pretoria-south-africa"}}
{"city_name":"Pretoria","continent_name":"Africa","country_iso_code":"SA","country_name":"South Africa","location":[28.21837,-25.73134],"source.ip":["10.4.54.6","10.4.54.7","10.4.54.8"]}
{"index":{"_id":"berlin-germany"}}
{"city_name":"Berlin","continent_name":"Europe","country_iso_code":"DE","location":[13.404954,52.520008],"country_name":"Germany","region_name":"","source.ip":["10.6.132.43","10.6.132.44"]}
{"index":{"_id":"tokyo-japan"}}
{"city_name":"Tokyo","continent_name":"Asia","country_iso_code":"JP","country_name":"Japan","location":[139.839478,35.652832],"source.ip":["10.7.1.76"]}


All of our documents that we store in the index private_geoips must have all the fields mentioned in the enrich_fields array within our policy.

PUT _enrich/policy/private_geoips_policy
{
"match": {
"indices": "private_geoips",
"match_field": "source.ip",
"enrich_fields": ["city_name", "continent_name", "country_iso_code", "country_name", "location"]
}
}


When we first create our policy or when we make changes to our index we need to re-execute our policy.

POST /_enrich/policy/private_geoips_policy/_execute


Let's create our ingest pipeline.

PUT /_ingest/pipeline/private_geoips
{
"description": "_description",
"processors": [
{
"dot_expander": {
"field": "source.ip"
}
},
{
"enrich": {
"policy_name": "private_geoips_policy",
"field": "source.ip",
"target_field": "geo",
"max_matches": "1"
}
},
{
"script": {
"lang": "painless",
"source": "ctx.geo.remove('source.ip')"
}
}
]
}


Finally let's test our pipeline.

POST /_ingest/pipeline/private_geoips/_simulate
{
"docs": [
{
"_source": {
"source.ip": "10.7.1.76"
}
},
{
"_source": {
"source.ip": "10.4.54.7"
}
}
]
}


And that’s it! Your private IP addresses are now enriched with geolocation data stored in the private_geoips index. Now all you need to do is maintain that lookup index.

## More to come

This post showed a simple way to enrich private IPs with geolocation data. At Elastic, we’re working on making that enrichment even easier with range queries. You can track that progress in issue #48988. As soon as that code is completed and merged, we’ll be able to enrich with IP ranges and not just exact matches, which will significantly reduce the work needed to maintain our private_geoips index. I will write a follow up once that happens.

In the meantime, you can try out this enrichment method in your existing environment, or spin up a free trial of Elastic Cloud and test it out with the sample data provided in the example above. Enjoy!