05 janvier 2017 Technique

Numeric and Date Ranges in Elasticsearch: Just Another Brick in the Wall

Par Nick Knize

Like this post? Love numbers and Elasticsearch? I’ll be at Elastic{ON} to meet with attendees face to face and answer your questions. Consider attending the event. Hope to see you there.

Imagine being able index your calendar to quickly find all events whose time range conflicts with a proposed event. Or creating a global television guide to find all shows, movies, sports, or broadcasts that are aired during certain time periods. Looking for something even more futuristic? Imagine creating a catalog of spectral signatures for all known cancer cells in order to more rapidly identify, classify, and diagnose potential malignant activity.

From discrete to continuous

Until recently, these use cases have been extremely difficult, or next to impossible, to solve at scale using Elasticsearch's discrete numeric and date fields. While running range queries over discrete points are the most typical and widely used capability, it quickly became clear that having the ability to index and search over continuous ranges would be a tremendously useful feature to support. Thanks to recent advancements in Apache Lucene this desire has become a reality in Elasticsearch.

Elasticsearch welcomes Numeric and Date Range field types

We are proud to announce the following new Range field types are included in the Elasticsearch 5.2 release:

  • integer_range
  • float_range
  • long_range
  • double_range
  • date_range

The mapping definition for these new range data types work the same way as their discrete Numeric and Date counterparts. The following provides an example mapping definition for a document type (called “conference”) that contains both an integer and date range:

PUT events/
{
  “mappings” : {
    “conference” : {
      “properties” : {
        “expected_attendees” : {
          “type” : “integer_range”
        },
        “time” : {
          “type” : “date_range”,
          “format” : “yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis” 
        }
      }
    }
  }
}

Indexing a document using the above mapping is as simple as defining the range fields the same way you would in a Numeric or Date Range Query. Like Date Range queries, indexing a Date Range field also uses the same Date Math format definition. The following example demonstrates indexing a document with an integer_range and date_range field type:

PUT events/conference/1
{
  “title” : “Pink Floyd / Wizard of Oz Halloween Trip”,
  “expected_attendees” : {
    “gte” : 100000,
    “lt” : 200001
  },
  “time” : {
    “gte” : “2015-10-31 12:00”,
    “lt” : “2015-11-01”
  }
}

Querying Numeric and Date Range fields use the same Range Query domain specific language (DSL) definition as before with the addition of a new optional relation parameter. This parameter enables the user to define the type of Range query desired. This is defined as one of: INTERSECTS (default), CONTAINS, or WITHIN and can be combined with other Boolean queries to achieve different desired results (e.g., DISJOINT). The following example demonstrates a WITHIN query over the events index:

GET events/_search
{
  “query” : {
    “range” : {
      “time” : {
        “gte” : “2015-10-01”,
        “lte” : “2015-12-31”,
        “relation” : “within”
      }
    }
  }
}

Like discrete numeric and date fields their new range counterparts are restricted to a single dimension only. In the future we plan to lift this restriction and expand them to handle 8 and 4 dimensions, respectively.

Lucene implementation

On October 18th, we published a blog post describing The Evolution of Numeric Range Filters in Apache Lucene. This historical account of Lucene’s maturation as a numeric search engine details the technical advancements of numeric indexing and search culminating in the Bkd-Tree Data Structure. Instead of continuing to represent numeric data using a structure specifically designed and tuned for text, the Bkd implementation introduced the first flexible tree structure designed specifically for indexing discrete numeric points. These discrete points can now be searched in Elasticsearch using the numeric range query DSL definition. Built around the theory and concepts of a k-dimensional tree (k-d tree), the new Points codec format also paved the way for providing multi-dimensional support for more advanced scientific or data analysis use cases and provide the foundation classes needed to bring the range field feature to Lucene and ultimately to Elasticsearch.

Indexing

The solution was fairly trivial: for each dimension, represent the dimensional numeric range as a 2 dimensional point where the encoding is stored as an array of minimum values, d, (for each dimension) followed by an array of maximum values, D. This means for the Bkd restriction of 8 dimensions the numeric range fields provided by Lucene only support indexing up to 4 dimensional ranges. The image below provides a simple illustration of Lucene’s dimensional range encoding.

Lucene's Dimensional Range Encoding

Essentially, to Lucene a 1-dimension range simply looks like a 2-dimension point. This encoding is then passed to the Bkd indexer which proceeds to build a hierarchical tree of d*2-dimensional points; where d is the dimension of the range, or 4 in the above illustration.

Searching

Implementing search for dimensional ranges boiled down to one critical step - defining and computing the spatial relation between the user provided search range (the target) and the indexed ranges (or minimum bounding ranges at each level of the tree). This was accomplished by implementing a dimensional range comparator that evaluates the range relation at each dimension. The spatial relations implemented in the comparator, and thus by Lucene numeric range queries, include: INTERSECTS, CONTAINS, and WITHIN. DISJOINT queries are then accomplished using an INTERSECTS query contained in a boolean query with a MUST_NOT clause. For multi-value fields a document is only considered DISJOINT if all values are disjoint. CROSSES relations are currently in-work and planned as a future enhancement. The following image illustrates the computed relations between two simple 1 dimensional ranges.

Range Relations

This example follows the convention that an open dot represents a range that does not include that specified value. The pink range represents the target (query) and the turquoise represents an indexed range.

Give it a try...

We are extremely excited to see the many unique and creative ways that Numeric and Date range indexing and searching is applied to solve your particular use case. As always we encourage feedback, suggestions, and contributions on our Elasticsearch github page and we hope these new field types come in handy for you!

And with that, the time is gone, the song is over… Enjoy!