2015年10月29日

Modeling data for fast aggregations

By Adrien Grand

When wanting to improve search performance, one common reaction is often to start looking at what settings can be tuned in order to make elasticsearch perform better. Unfortunately, even when successful, this approach rarely improves performance by more than a few percent. On the other hand changing the way that documents are modeled can have a dramatic impact on performance.

index.png For instance imagine that you want to find all documents that contain a given 3-letters substring (eg. “ump” would find documents containing “jumps”). The naive approach would be to to index documents with the default configuration and then search using a wildcard query. Unfortunately, the wildcard query is one of the slowest queries that is exposed in Elasticsearch. On the other hand, you could change your analysis chain in order to index every substring using the n-gram tokenizer (so that “jumps” would generate three tokens “jum”, “ump” and "mps") and the wildcard query could be replaced with a simple term query, which performs much faster.

The general advice is that you should model your data in order to make search operations as lightweight as possible. A corollary is that you cannot make good data modeling decisions without knowing the queries that your users run. And what we just said about queries can also be applied to aggregations: we will look at an example in order to see how this could work. Imagine that you allow your users to search real-estate listings documents which might look like this:

{
 “location”: [-0.37, 49.2],
 “price_euros”: 188000,
 “surface_m2”: 132,
 “has_photo”: true,
 “has_swimming_pool”: false,
 “has_balcony”: true
}

And in order to give your users insights about the listings, you might use an aggregation that would look like this:

{
  “aggs”: {
    “location_grid”: {
      “geohash_grid”: {
        “field”: “location”,
        “precision”: 3
      }
    }
  },
  “price_ranges”: {
    “range”: {
      “field”: “price_euros”,
      “ranges”: [
        {“to”: 20000},
        {“from”:20000, “to”: 50000},
        {“from”:50000, “to”: 100000},
        {“from”:100000, “to”: 200000},
        {“from”:200000}
      ]
    }
  },
  “surface_histo”: {
    “histogram”: {
      “field”: “surface_m2”,
      “interval”: 20
    }
  },
  “has_photo_terms”: {
    “terms”: {
      “field”: “has_photo”
    }
  },
  “has_swimming_pool_terms”: {
    “terms”: {
      “field”: “has_swimming_pool”
    }
  },
  “has_balcony_terms”: {
    “terms”: {
      “field”: “has_balcony”
    }
  }
}

Pre-compute data for aggregations

If most requests would use the same ranges for the price_ranges aggregation and the same interval for the surface_histo aggregation, then it is possible to optimize this aggregation by directly indexing the ranges that a document belongs to and replacing these range and histogram aggregations with a terms aggregation. Here is what the document and aggregations would look like now:

{
 “location”: [-0.37, 49.2]
 “price_euros”: 188000,
 “price_euros_range”: “100000-200000”,
 “surface_m2”: 132,
 “surface_m2_range”: “120-140”,
 “has_photo”: true,
 “has_swimming_pool”: false,
 “has_balcony”: true
}
{
  “aggs”: {
    “location_grid”: {
      “geohash_grid”: {
        “field”: “location”,
        “precision”: 3
      }
    }
  },
  “price_ranges”: {
    “terms”: {
      “field”: “price_euros_range”
    }
  },
  “surface_histo”: {
    “terms”: {
      “field”: “surface_m2_range”
    }
  },
  “has_photo_terms”: {
    “terms”: {
      “field”: “has_photo”
    }
  },
  “has_swimming_pool_terms”: {
    “terms”: {
      “field”: “has_swimming_pool”
    }
  },
  “has_balcony_terms”: {
    “terms”: {
      “field”: “has_balcony”
    }
  }
}

Merge aggregations together

This aggregation will likely perform a bit faster, but we can further improve it: all these terms aggregations are executed against a field that can only have a finite set of values and we could easily merge them all into a single terms aggregation:

{
 “location”: [-0.37, 49.2]
 “price_euros”: 188000,
 “surface_m2”: 132,
 “has_photo”: true,
 “has_swimming_pool”: false,
 “has_balcony”: true,
 “attributes”: [
   “price_euros_range_100000-200000”,
   “surface_m2_range_120-140”,
   “has_photo”,
   “has_balcony”
 ]
}
{
  “aggs”: {
    “location_grid”: {
      “geohash_grid”: {
        “field”: “location”,
        “precision”: 3
      }
    }
  },
  “attributes_terms”: {
    “terms”: {
      “field”: “attributes”
    }
  }
}

Conclusion

As you can see we managed to go down from 6 aggregations to only 2 that will likely perform faster and still give us exactly the same information. For instance if you want to know how many listings have a balcony, you just need to look up the number of documents that contain has_balcony in the attributes_terms aggregation.

However we lost flexibility: with such an approach, more logic needs to be moved to the client and you can’t change the ranges for prices without reindexing. So if you want to let users decide on the ranges which should be used to compute buckets, it won’t work. But both approaches can be complementary: as you can see we still kept the original data in our documents, we just added new fields. So if you have a default set of ranges that would be used by 95% of your requests, you could still benefit from this optimized approach for these requests that rely on the defaults, and fallback to a regular range aggregation when users want to dig more into the data.