29 janvier 2015

Numeric Aggregations: An Exploration of UK Housing Data

Par Colin Goodheart-Smithe

The Elasticsearch aggregations series continues! In the previous blog posts, we discovered how to create aggregation requests, and how to embed aggregations within other aggregations. Today, we will explore some of the aggregations available for numeric data.

Since this is a developer-focused series, we suggest you follow along at /blog/numeric-aggregations-an-exploration-of-uk-housing-datahome, so that you can play with the data and the various code examples that we show. For this article, we are going to use the UK Land Registry 2014 dataset. This dataset includes details of every house sale in the UK for 2014.

To follow along with this blog you will need Elasticsearch version 1.3.0 or later.

To follow along, you'll need to restore a Snapshot into your local cluster. The snapshot is about 200MB, and it may take some time depending on your connection:

# Register the land registry Repository
PUT /_snapshot/demo_uk_landregistry_data
{
  "type": "url",
  "settings": {
    "url": "http://data.elasticsearch.org/blogs/data/snapshots/demo_uk_landregistry_data/demo_uk_landregistry_data"
  }
}
# (Optional) Inspect the repository to view available snapshots
GET /_snapshot/demo_uk_landregistry_data/_all
# Restore the snapshot into your cluster
POST /_snapshot/demo_uk_landregistry_data/demo_uk_landregistry_data/_restore
# Watch the download progress.
GET /housesales/_recovery

Once your cluster has finished restoring the Snapshot, let's perform a simple search to see what the data holds:

GET housesales/_search
{
   "_shards": {...},
   "hits": {
      "total": 646386,
      "max_score": 1,
      "hits": [
         {
            "_index": "housesales",
            "_type": "housesale",
            "_id": "AUsHjsE4-ULRehTiURN7",
            "_score": 1,
            "_source": {
               "town": "DUNSTABLE",
               "status": "A",
               "location": {
                  "lat": 51.882769709,
                  "lon": -0.513041822
               },
               "district": "CENTRAL BEDFORDSHIRE",
               "locality": "",
               "price": 140000,
               "housetype": "Terraced",
               "oldnew": "N",
               "county": "CENTRAL BEDFORDSHIRE",
               "duration": "Freehold",
               "street": "GREAT NORTHERN ROAD",
               "postcode": "LU54BN",
               "date": "2014-06-19 00:00",
               "paon": "43",
               "saon": ""
            }
         },
         ...
      ]
   }
}

Each document in our index represents an individual property sale in the UK and contains various metadata about the sale such as the price paid, the date of the sale, whether the sale was freehold or leasehold, and various information indicating the location of the property. There are around 650,000 documents in the index so there should be plenty of data to use to identify interesting trends.

Let's start our investigation by looking at the dataset as a whole:

GET housesales/_search?search_type=count
{
  "aggs": {
    "house_price_stats": {
      "stats": {
        "field": "price"
      }
    }
  }
}

Here, we are simply asking for some basic statistics about the price of sales as a whole. This gives us the following response:

{
   ...
   "hits": {
      "total": 646386,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "house_price_stats": {
         "count": 646386,
         "min": 7000,
         "max": 50000000,
         "avg": 257876.91169053785,
         "sum": 166688025440
      }
   }
}

So, Elasticsearch has reported back the count (the number of sales) and the minimum, maximum, average (mean), and sum of the sales. We could do the same thing by using the min, max, avg, sum, and value_count aggregations, but the stats aggregation provides a convenient way to return all this information together.

Note: There is also an 'extended_stats' aggregation that provides even more statistics such as standard_deviation and variance.

We can see from the above response that property sales in the UK in 2014 totalled over £166 billion (over $250 billion), with a mean average sale of around £260,000 ($390,000). But this aggregation doesn't tell us much about the distribution of house prices. How much are the few very expensive houses pulling up the mean average value. To answer this we can use the percentiles aggregation.

A percentile indicates the value below which a given percentage of observations fall. The 75th percentile of house sale would be the value of a house sale which 75% of all house sales fall below.
GET housesales/_search?search_type=count
{
  "aggs": {
    "house_price_stats": {
      "stats": {
        "field": "price"
      }
    },
    "house_price_percentiles": {
      "percentiles": {
        "field": "price",
        "percents": [
          50
        ]
      }
    }
  }
}

So now we have added a new aggregation called 'house_price_percentiles' to our previous request and we are using it to ask for the maximum value of a sale that would fit into the first 50% of our sales data (also known as the 50th percentile or the median value).

Although this would be quite useful on its own, before we run the request, let's add some other percentiles to our request

GET housesales/_search?search_type=count
{
  "aggs": {
    "house_price_stats": {
      "stats": {
        "field": "price"
      }
    },
    "house_price_percentiles": {
      "percentiles": {
        "field": "price",
        "percents": [
          1,
          25,
          50,
          75,
          99
        ]
      }
    }
  }
}

Now, we are asking for values for not only the 50th percentile but also the 1st, 25th, 75th and 99th percentiles. When we run this request we get back the following:

{
   ...
   "aggregations": {
      "house_price_percentiles": {
         "values": {
            "1.0": 45000.204685714285,
            "25.0": 126873.78905788704,
            "50.0": 188892.36462063208,
            "75.0": 290076.7973567194,
            "99.0": 1296500.8114525543
         }
      },
      "house_price_stats": {
         "count": 646386,
         "min": 7000,
         "max": 50000000,
         "avg": 257876.91169053785,
         "sum": 166688025440
      }
   }
}

This shows us that the median sale value is only ~£189,000, and the mean value is actually closer to the 75th percentile than the median, showing that there are a small number of sales at the top end of the price scale which significantly skew the mean average sale value.

So we know that the mean average sale value is much higher than the median sale value, but what percentage of houses were sold for lower than the mean average value? To answer this question we could keep adding percentiles to our above request until we get a result close to the mean value, but this feels a lot like shooting in the dark'. Luckily there is an aggregation which does the reverse of the percentiles aggregation, the percentile ranks aggregation. We can now ask the percentile ranks aggregation to calculate the percentile which represents our mean sale value in our dataset.

GET housesales/_search?search_type=count
{
  "aggs": {
    "house_price_percentile_ranks": {
      "percentile_ranks": {
        "field": "price",
        "values": [
          257876
        ]
      }
    }
  }
}

Which gives us the following response:

{
   ...
   "aggregations": {
      "house_price_percentile_ranks": {
         "values": {
            "257876.0": 69.8520536775868
         }
      }
   }
}

We can see now that roughly two thirds of houses are sold under the mean average sale value.

Aside: UK Stamp Duty

In the UK every property sale above £125,000 is subject to a tax paid by the buyer, which is called the stamp duty land tax. The amount you pay is determined by the purchase amount and is divided into bands. If the value of the purchase is greater than the threshold for a band, you have to pay the bands percentage of the entire value of the property. The bands for 2014 were as follows:

Threshold Value
Stamp duty rate
£0 0%
£125,000 1%
£250,000 3%
£500,000 4%
£1,000,000 5%
£2,000,000 7%

So for the maximum sale in our dataset which was £50,000,000 the stamp duty tax would be:

£50,000,000 * 0.07 = £3,500,000

Note: The way stamp duty is calculated in the UK changed to a new system in December 2014

Using the percentile ranks aggregation we can determine what percentage of UK house sales in 2014 were affected by each stamp duty threshold.

GET housesales/_search?search_type=count
{
  "aggs": {
    "house_price_percentile_ranks": {
      "percentile_ranks": {
        "field": "price",
        "values": [
          125000,
          250000,
          500000,
          1000000,
          2000000
        ]
      }
    }
  }
}

which produces the following result:

{
   ...
   "aggregations": {
      "house_price_percentile_ranks": {
         "values": {
            "125000.0": 24.08112180647477,
            "250000.0": 68.51649014675442,
            "500000.0": 92.08539398815546,
            "1000000.0": 98.36604199119392,
            "2000000.0": 99.6443848633277
         }
      }
   }
}

Almost a quarter of all house sales in the UK in 2014 did not have to pay any stamp duty and over two thirds of sales had to pay either no stamp duty or at the 1% rate.

Now that we've looked at the dataset as a whole, let's try to work out how these findings change in different counties of the UK. For example, which is the most expensive county in terms of house prices?

Here we can make use of a very useful feature in the terms aggregation, the ability to sort the returned buckets by a sub-metric aggregation.

First, lets start with a naive approach to answering this question, by asking for the top 5 UK counties sorted by the maximum sale price in descending order:

GET housesales/_search?search_type=count
{
  "aggs": {
    "counties": {
      "terms": {
        "field": "county.raw",
        "size": 5,
        "order": {
          "house_price_stats[max]": "desc"
        }
      },
      "aggs": {
        "house_price_stats": {
          "stats": {
            "field": "price"
          }
        },
        "house_price_percentiles": {
          "percentiles": {
            "field": "price",
            "percents": [
              1,
              25,
              50,
              75,
              99
            ]
          }
        },
        "stamp_duty_bands": {
          "percentile_ranks": {
            "field": "price",
            "values": [
              125000,
              250000,
              500000,
              1000000,
              2000000
            ]
          }
        }
      }
    }
  }
}

This request uses components we've already encountered: terms, percentiles, etc. But you'll notice something new about the terms aggregation: the order clause. This is used to indicate that we want the buckets returned from the aggregation ordered by something other than descending document count. In this instance, we want to order the county terms by the value of the 'max' field in the 'house_price_stats' aggregation and take the top 5 buckets (as stated by the 'size' parameter).

The result is the statistics for the 5 counties with the highest maximum sales in 2014.

{
   ...
   "aggregations": {
      "counties": {
         "doc_count_error_upper_bound": -1,
         "sum_other_doc_count": 511421,
         "buckets": [
            {
               "key": "GREATER LONDON",
               "doc_count": 84256,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 110000.00000000001,
                     "25.0": 250000,
                     "50.0": 361328.9936144069,
                     "75.0": 556307.7822199536,
                     "99.0": 3076310.5196182355
                  }
               }, 
               "house_price_stats": {
                  "count": 84256,
                  "min": 50881,
                  "max": 50000000,
                  "avg": 525786.4564660083,
                  "sum": 44300663676
               }
            },
            {
               "key": "KENT",
               "doc_count": 19843,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 65000,
                     "25.0": 160081.13354497356,
                     "50.0": 214969.56200753682,
                     "75.0": 294245.9144186047,
                     "99.0": 993350.5076923068
                  }
               },
               "house_price_stats": {
                  "count": 19843,
                  "min": 26250,
                  "max": 49595000,
                  "avg": 261625.75774832434,
                  "sum": 5191439911
               }
            },
            {
               "key": "SURREY",
               "doc_count": 16168,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 97835,
                     "25.0": 249709.7963936475,
                     "50.0": 341767.1243912337,
                     "75.0": 516881.80511538463,
                     "99.0": 1990000
                  }
               },
               "house_price_stats": {
                  "count": 16168,
                  "min": 12000,
                  "max": 12500000,
                  "avg": 448801.1140524493,
                  "sum": 7256216412
               }
            },
            {
               "key": "OXFORDSHIRE",
               "doc_count": 7944,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 90620,
                     "25.0": 212276.70833333334,
                     "50.0": 274681.9068736142,
                     "75.0": 380306.0258838384,
                     "99.0": 1392833.3333333284
                  }
               },
               "house_price_stats": {
                  "count": 7944,
                  "min": 38750,
                  "max": 10000000,
                  "avg": 345101.6182024169,
                  "sum": 2741487255
               }
            },
            {
               "key": "BUCKINGHAMSHIRE",
               "doc_count": 6754,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 96000,
                     "25.0": 210044.27898321193,
                     "50.0": 302701.05212355213,
                     "75.0": 471529.1039576365,
                     "99.0": 1639470.0000000005
                  }
               },
               "house_price_stats": {
                  "count": 6754,
                  "min": 36000,
                  "max": 7075000,
                  "avg": 395626.3212910868,
                  "sum": 2672060174
               }
            }
         ]
      }
   }
}

So it looks like Greater London is the most expensive place to live. However, the problem with sorting the data this way is that although Greater London certainly had the highest priced sale in 2014, it doesn't tell us anything about the other house sales. That one sale is likely to skew the data a lot in favour of the Greater London area. So Greater London certainly has the most expensive address sold in the UK in 2014 but as a county as a whole, is it really at the top of the list?

To answer this we could sort the terms aggregation by the mean average, but, as we discussed previously, this can be (and from our previous findings, is) skewed heavily by the high-end sales. A better metric here might be the median value. This will essentially tell us at what price the 'middle' sale for 2014 was for each county.

To do this, we sort the terms aggregation based on the 50th percentile of that term, rather than the max:

GET housesales/_search?search_type=count
{
  "aggs": {
    "counties": {
      "terms": {
        "field": "county.raw",
        "size": 5,
        "order": {
          "house_price_percentiles[50]": "desc"
        }
      },
      "aggs": {
        "house_price_stats": {
          "stats": {
            "field": "price"
          }
        },
        "house_price_percentiles": {
          "percentiles": {
            "field": "price",
            "percents": [
              1,
              25,
              50,
              75,
              99
            ]
          }
        },
        "stamp_duty_bands": {
          "percentile_ranks": {
            "field": "price",
            "values": [
              125000,
              250000,
              500000,
              1000000,
              2000000
            ]
          }
        }
      }
    }
  }
}

Now we can see the order of the counties changes quite a bit:

{
   ...
   "aggregations": {
      "counties": {
         "doc_count_error_upper_bound": -1,
         "sum_other_doc_count": 535352,
         "buckets": [
            {
               "key": "WINDSOR AND MAIDENHEAD",
               "doc_count": 1774,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 131387.5,
                     "25.0": 283679.3706293706,
                     "50.0": 375076.70454545453,
                     "75.0": 535702.1296296297,
                     "99.0": 2251350
                  }
               },
               "house_price_stats": {
                  "count": 1774,
                  "min": 46250,
                  "max": 5300000,
                  "avg": 479086.5941375423,
                  "sum": 849899618
               },
               "stamp_duty_bands": {
                  "values": {
                     "125000.0": 0.9582863585118376,
                     "250000.0": 18.207440811724915,
                     "500000.0": 72.71709605768585,
                     "1000000.0": 95.03945885005636,
                     "2000000.0": 98.81623506200113
                  }
               }
            },
            {
               "key": "GREATER LONDON",
               "doc_count": 84256,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 110000,
                     "25.0": 250000,
                     "50.0": 361150.5580953402,
                     "75.0": 556310.594361639,
                     "99.0": 3076195.800751876
                  }
               },
               "house_price_stats": {
                  "count": 84256,
                  "min": 50881,
                  "max": 50000000,
                  "avg": 525786.4564660083,
                  "sum": 44300663676
               },
               "stamp_duty_bands": {
                  "values": {
                     "125000.0": 2.0224078997341435,
                     "250000.0": 26.231959741739463,
                     "500000.0": 70.54815638375172,
                     "1000000.0": 91.68715171319795,
                     "2000000.0": 97.83517054664898
                  }
               }
            },
            {
               "key": "SURREY",
               "doc_count": 16168,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 97890,
                     "25.0": 249780.1736677116,
                     "50.0": 341882.1553970223,
                     "75.0": 516708.2499209361,
                     "99.0": 1990000
                  }
               },
               "house_price_stats": {
                  "count": 16168,
                  "min": 12000,
                  "max": 12500000,
                  "avg": 448801.1140524493,
                  "sum": 7256216412
               },
               "stamp_duty_bands": {
                  "values": {
                     "125000.0": 2.3070262246412665,
                     "250000.0": 28.315190499752596,
                     "500000.0": 73.96239802319863,
                     "1000000.0": 94.69347101341646,
                     "2000000.0": 99.078426521524
                  }
               }
            },
            {
               "key": "WOKINGHAM",
               "doc_count": 2082,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 107715,
                     "25.0": 248710.41666666663,
                     "50.0": 320477.2727272727,
                     "75.0": 429986.1764705882,
                     "99.0": 1176840.000000002
                  }
               },
               "house_price_stats": {
                  "count": 2082,
                  "min": 45000,
                  "max": 3300000,
                  "avg": 366122.4620557157,
                  "sum": 762266966
               },
               "stamp_duty_bands": {
                  "values": {
                     "125000.0": 2.2574447646493754,
                     "250000.0": 28.914505283381363,
                     "500000.0": 85.01440922190201,
                     "1000000.0": 98.52605667627282,
                     "2000000.0": 99.82118676434847
                  }
               }
            },
            {
               "key": "BUCKINGHAMSHIRE",
               "doc_count": 6754,
               "house_price_percentiles": {
                  "values": {
                     "1.0": 96000,
                     "25.0": 210042.35757575755,
                     "50.0": 301217.14015928353,
                     "75.0": 471271.60837438423,
                     "99.0": 1639470.0000000005
                  }
               },
               "house_price_stats": {
                  "count": 6754,
                  "min": 36000,
                  "max": 7075000,
                  "avg": 395626.3212910868,
                  "sum": 2672060174
               },
               "stamp_duty_bands": {
                  "values": {
                     "125000.0": 3.849570624814924,
                     "250000.0": 39.19159016878886,
                     "500000.0": 78.51659370525427,
                     "1000000.0": 96.06172164854438,
                     "2000000.0": 99.55581877405982
                  }
               }
            }
         ]
      }
   }
}

Now, Greater London and Surrey are still in the top 5, but have changed position. Most notably, Windsor and Maidenhead has got to the top of the list where it hadn't made the top 5 previously. The median value is higher than Greater London suggesting that although London has more expensive properties than Windsor and Maidenhead, it does not have as high a concentration of expensive housing. We can also see that where the national data suggested that about a quarter of house sales are not subject to stamp duty tax, in Windsor and Maidenhead less than 1% of house sales avoid the tax.

Conclusion

In this article we have explored how numeric aggregations can be used to explore a dataset. We have seen how the stats aggregation can be used to gain a rudimentary understanding of the data, and how this can be furthered by combining it with the percentiles and percentile ranks aggregations.