06 September 2013

City Bikes and Elasticsearch Facets

By Konrad Beiske

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

A small demo of using facets in Elasticsearch on time series.

Introduction

In this article we will implement a solution with Elasticsearch from scratch. No prior knowledge of Elasticsearch is required. The installation of Elasticsearch is however skipped as I will be using a cluster hosted at Found. A local installation is described at elastic.co/guide or you can sign up for a free cluster. Where appropriate I will try to rather do gradual refinement than presenting an advanced, and possibly overengineered, solution right away. This implies that not all examples are the best practice way of doing things and the experience reader probably will spot a few optimizations right away, but as time elapses and more data is accumulated I plan to address the issues in further articles, one low hanging fruit at a time.

The Case

Oslo is one of many cities that have so called city bikes, or bikes for hire. The system works like this: a user walks to a nearby bike rack and unlocks a bike using his card. He may return the bike at any rack in Oslo. To assist users finding a nearby rack with free bikes or locks, the system provides the status of every rack through a webpage or custom smartphone app. In a city like Oslo there is one curious problem with this system: people tend to prefer using bikes downhill. This results in a congestion of bikes in the city centre (literally downtown!). Anyway, this is my theory. I know for a fact that the system operator use trucks for picking up and dropping off bikes. What I don’t know is whether the purpose is to transport the bikes to and from the workshop or to alleviate congestion. My objective for this case study is to explore this using Elasticsearch. Mainly there are two reasons for exploring this. Firstly to satisfy my curiosity and secondly to take a crack at estimating when racks get congested or depleted. The current applications are able to find the nearest available bike or lock, but they are not able to estimate the likelihood of the bike or lock remaining available by the time you get there.

The Data

The operator’s city bike webpage contains a map of all of their bike racks and their statuses. Every bike rack has a name, a position, a bike count and a free locks count. In Scala I express this as a case class:

case class BikeRack(name: String, longitude: Double, lattitude: Double, bicycles: Int, locks: Int, time : Date)

Note the addition of a date. This refers to the time at which the given status was observed.

Indexing

Using TagSoup it’s pretty straightforward to extract the required information from the webpage. TagSoup is a lenient html-parser that uses a best-effort approach for treating any html as xhtml. Combining TagSoup and Scala’s xml-support we can write a parser like this:

def getData() = {
  val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
  val parser = parserFactory.newSAXParser()
  val source = new org.xml.sax.InputSource("http://www.bysykler.no/oslo/kart-over-sykkelstativer")
  source.setEncoding("utf-8")
  val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
  adapter.loadXML(source, parser)
}
def parse(data: Node): Seq[BikeRack] = {
  val timestamp = new Date()
  val onlinePattern = """.*Ledige sykler: (\d+).*Ledige låser: (\d+).*""".r
  val disabledPattern = """.*Ikke operativt.*""".r
  for {
    divTag <- data \\ "div"
    if (divTag \ "@class").toString() == "mapMarker"

    val status = getString(divTag, "data-content")
    val result = status match {
      case onlinePattern(bicycles, locks) =>
        Some(
          BikeRack(
            getString(divTag, "data-name"),
            getDouble(divTag, "data-poslng"),
            getDouble(divTag, "data-poslat"),
            Integer.parseInt(bicycles),
            Integer.parseInt(locks), timestamp))
      case disabledPattern() =>
        None
      case x: String =>
        println("Unrecognised format: " + x); None
    }
    if result.isDefined
  } yield result.get
}
def getString(node: Node, attr: String): String = {
  (node \ ("@" + attr)).toString()
}
def getDouble(node: Node, attr: String): Double = {
  java.lang.Double.valueOf(getString(node, attr))
}

In order to get this sequence of bike racks into Elasticsearch we need a client. Today’s option is Wabisabi. Wabisabi is an easy to use and is a Scala native client for Elasticsearch’s REST API.

def push(racks: Seq[BikeRack]) {
  val client = new Client("http://.foundcluster.com:9200/")
  import ExecutionContext.Implicits.global
  val futures = for {
    rack <- racks
    val future = client.index(
      index = "oslo", `type` = "bikerack", data = rack.toJson, refresh = false)
  } yield (rack, future)
  for ((rack, future) <- futures) {
    future.onSuccess {
      case result => {
        println(s"Indexing: [$rack]. Got response: [${result.getResponseBody()}]")
      }
    }
  }
  val seq = Future.sequence(futures.map(_._2));
  seq.onComplete {
    case _ => Client.shutdown()
  }
  seq.onSuccess{
    case l => println("Indexed: " + l.size + " racks")
  }
}

Elasticsearch expects the data to be JSON. With Scala’s string interpolation introduced in 2.10 its pretty straightforward to create JSON and redefine the BikeRack class like this:

case class BikeRack(name: String, longitude: Double, lattitude: Double, bicycles: Int, locks: Int, time : Date) {
  
  def toJson = {
    val utc = TimeZone.getTimeZone("UTC")
    val weekDayFormat = new SimpleDateFormat("u")
    weekDayFormat.setTimeZone(utc);
    val hourOfDayFormat = new SimpleDateFormat("H")
    hourOfDayFormat.setTimeZone(utc)
    val timeStampFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'")
    timeStampFormat.setTimeZone(utc)
    s"""
    {
    	"name" : "$name",
    	"location" : {
    		"lat" : $lattitude,
    		"lon" : $longitude
    	},
    	"bicycles" : $bicycles,
    	"locks" : $locks,
    	"timestamp" : "${timeStampFormat.format(time)}",
    	"weekday" : ${weekDayFormat.format(time)},
    	"hourOfDay" : ${hourOfDayFormat.format(time)}
    }"""
  }
}

For strings and numbers the formatting is native to JSON, but dates and geopositions are a bit trickier. Elasticsearch carefully tries to detect the contents of json strings when new fields are processed. If one formats dates according to one of the standard formats then no specific mapping is required.

Note the extraction of weekday and hour of day into separate fields. This is redundant, but it allows for greater flexibility when building queries that treat time as recurring events rather than a straight continuum.

Elasticsearch is now ready to receive data from our parser. All we have to do is to invoke it regularly over a period of time so we can start trend analysis.

Queries

Now the fun begins. Don’t worry if you don’t have much data in your cluster yet, we will start with some basic queries. Our first example is to calculate the average number of bikes for a given rack. We can do this by using the statistical facet and a simple match query. The match query retrieves all the documents that are named: “92-Blindernveien 5” and the statistical facet calculates the average of the bicycle field for all the documents retrieved.

The Statistical Facet

Simple average for all observations:

curl http://.foundcluster.com:9200/oslo/bikerack/_search?pretty=true -XPOST -d '{
"query": {
    "match": {"name": "92-Blindernveien 5"}
},
"facets" : {
        "stat1" : {
            "statistical" : {
                "field" : "bicycles"
            }
        }
    }

}'

For this demo I have been running the parser every five minutes for more than two weeks, and the total number of bike rack observations in the index is around 54 000. Executing the above query produces a result like this:

{
  "took" : 2229,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 4878,
    "max_score" : 4.871404,
    "hits" : [  ]
  },
  "facets" : {
    "stat1" : {
      "_type" : "statistical",
      "count" : 4878,
      "total" : 85934.0,
      "min" : 0.0,
      "max" : 28.0,
      "mean" : 17.616646166461663,
      "sum_of_squares" : 1710654.0,
      "variance" : 40.3413547214603,
      "std_deviation" : 6.351484450225813
    }
  }
}

In essence, the result consists of three parts. A metadata section, a hits section and a facets section. In the hits section we get the total number of documents that matched the query section of our query. In this case a total of 4878 bike rack observations. Unless you specify otherwise, Elasticsearch will include the ten highest ranked documents in the nested hits key. For this query the interesting part is in the facets section. The stat1 key holds the result of our stat1 facet. From this section we see that the average is 17,62 bikes. Combining the average with a standard deviation of 6,35 we can deduce that this bike rack was not empty in 95% of the observations, but the minimum value of 0 tells us that this rack have been observed as depleted in at least one observation.

Using the Hour of Day

When I leave home for work in the morning I’m often running a bit late, so every second counts. I have the option of walking to the closest rack or taking the bus. As it happens, it’s actually faster to go by bike than taking the bus, but the bike rack and the bus stop are located in opposite directions. I therefore use my mobile phone to check the status of the rack. As a matter of fact, I cannot remember the last time the rack was empty in the morning. This begs the question: can Elasticsearch prove that the probability of the rack being empty when I leave for work is very little? With that information at hand I could save those precious seconds it takes to check the current status of the rack. Let’s further refine our query and only consider observations between 09:00 and 10:00.

curl http://.foundcluster.com:9200/oslo/bikerack/_search?pretty=true -XPOST -d '{
"query": {
    "bool" : {
        "must" : [
            {
                "match": { "name": "92-Blindernveien 5"}
            },
            {
                "match": { "hourOfDay": "9"}
            }
        ]
    }
},
"facets" : {
        "stat1" : {
            "statistical" : {
                "field" : "bicycles"
            }
        }
    }
}'

The bool query allows us to define several queries and how Elasticsearch should join their results. In this case we require a match in both queries. This time the statistical facet gave us the following result:

"stat1" : {
      "_type" : "statistical",
      "count" : 213,
      "total" : 4716.0,
      "min" : 10.0,
      "max" : 28.0,
      "mean" : 22.140845070422536,
      "sum_of_squares" : 107494.0,
      "variance" : 14.44964623421282,
      "std_deviation" : 3.8012690294443536
    }

The minimum observed number of bikes is 10, the mean is 22.14 and the standard deviation is 3,80. Based on these figures we can conclude that the rack has never been depleted between 9:00 and 10:00 in the observed time and is not likely to become depleted at that time in the near future.

To better understand the trends of this particular bike rack we can use the terms_stats facet. The terms_stats facet is similar to the statistical facet, but requires specification of a key field, which it uses to group the documents by, calculate and calculate statistics for every term in that field. Using the terms_stats facet our query looks like this:

curl http://.foundcluster.com:9200/oslo/bikerack/_search?pretty=true -XPOST -d '{
"query": {
    "match": {"name": "92-Blindernveien 5"}
},
"facets" : {
        "stat1" : {
            "terms_stats" : {
                "key_field" : "hourOfDay",
                "value_field" : "bicycles",
                "size": 24,
                "order": "term"
            }
        }
    }

}'

The result shows the statistics for every hour of the day.

    "stat1" : {
      "_type" : "terms_stats",
      "missing" : 0,
      "terms" : [ {
        "term" : "0",
        "count" : 203,
        "total_count" : 203,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3933.0,
        "mean" : 19.374384236453203
      }, {
        "term" : "1",
        "count" : 203,
        "total_count" : 203,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3939.0,
        "mean" : 19.40394088669951
      }, {
        "term" : "10",
        "count" : 218,
        "total_count" : 218,
        "min" : 6.0,
        "max" : 28.0,
        "total" : 4330.0,
        "mean" : 19.862385321100916
      }, {
        "term" : "11",
        "count" : 214,
        "total_count" : 214,
        "min" : 3.0,
        "max" : 27.0,
        "total" : 3901.0,
        "mean" : 18.22897196261682
      }, {
        "term" : "12",
        "count" : 211,
        "total_count" : 211,
        "min" : 0.0,
        "max" : 27.0,
        "total" : 3003.0,
        "mean" : 14.23222748815166
      }, {
        "term" : "13",
        "count" : 211,
        "total_count" : 211,
        "min" : 0.0,
        "max" : 25.0,
        "total" : 2530.0,
        "mean" : 11.990521327014218
      }, {
        "term" : "14",
        "count" : 203,
        "total_count" : 203,
        "min" : 0.0,
        "max" : 25.0,
        "total" : 1800.0,
        "mean" : 8.866995073891626
      }, {
        "term" : "15",
        "count" : 203,
        "total_count" : 203,
        "min" : 0.0,
        "max" : 25.0,
        "total" : 2112.0,
        "mean" : 10.403940886699507
      }, {
        "term" : "16",
        "count" : 204,
        "total_count" : 204,
        "min" : 0.0,
        "max" : 28.0,
        "total" : 3259.0,
        "mean" : 15.97549019607843
      }, {
        "term" : "17",
        "count" : 202,
        "total_count" : 202,
        "min" : 4.0,
        "max" : 27.0,
        "total" : 3275.0,
        "mean" : 16.212871287128714
      }, {
        "term" : "18",
        "count" : 202,
        "total_count" : 202,
        "min" : 0.0,
        "max" : 26.0,
        "total" : 3276.0,
        "mean" : 16.217821782178216
      }, {
        "term" : "19",
        "count" : 202,
        "total_count" : 202,
        "min" : 1.0,
        "max" : 26.0,
        "total" : 3748.0,
        "mean" : 18.554455445544555
      }, {
        "term" : "2",
        "count" : 203,
        "total_count" : 203,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3936.0,
        "mean" : 19.389162561576356
      }, {
        "term" : "20",
        "count" : 204,
        "total_count" : 204,
        "min" : 6.0,
        "max" : 28.0,
        "total" : 3548.0,
        "mean" : 17.392156862745097
      }, {
        "term" : "21",
        "count" : 202,
        "total_count" : 202,
        "min" : 10.0,
        "max" : 28.0,
        "total" : 3775.0,
        "mean" : 18.68811881188119
      }, {
        "term" : "22",
        "count" : 200,
        "total_count" : 200,
        "min" : 9.0,
        "max" : 27.0,
        "total" : 3848.0,
        "mean" : 19.24
      }, {
        "term" : "23",
        "count" : 202,
        "total_count" : 202,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3923.0,
        "mean" : 19.42079207920792
      }, {
        "term" : "3",
        "count" : 201,
        "total_count" : 201,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3892.0,
        "mean" : 19.36318407960199
      }, {
        "term" : "4",
        "count" : 203,
        "total_count" : 203,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3935.0,
        "mean" : 19.38423645320197
      }, {
        "term" : "5",
        "count" : 202,
        "total_count" : 202,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3898.0,
        "mean" : 19.297029702970296
      }, {
        "term" : "6",
        "count" : 203,
        "total_count" : 203,
        "min" : 10.0,
        "max" : 27.0,
        "total" : 3734.0,
        "mean" : 18.39408866995074
      }, {
        "term" : "7",
        "count" : 203,
        "total_count" : 203,
        "min" : 9.0,
        "max" : 27.0,
        "total" : 3664.0,
        "mean" : 18.049261083743843
      }, {
        "term" : "8",
        "count" : 203,
        "total_count" : 203,
        "min" : 12.0,
        "max" : 28.0,
        "total" : 4443.0,
        "mean" : 21.886699507389164
      }, {
        "term" : "9",
        "count" : 213,
        "total_count" : 213,
        "min" : 10.0,
        "max" : 28.0,
        "total" : 4716.0,
        "mean" : 22.140845070422536
      } ]
    }

Histogram Facet

The above result is great, but isn’t it a bit odd that a numeric field is ordered alphabetically? The explanation is this: the terms_stats facet works on strings, and a bug in the first version of my parser ended up with Elasticsearch mapping hourOfDay and dayOfWeek as strings. Of course, the best solution for cleaning up the mess is to reindex the data, but what if reindexing is not feasible and the fields were not indexed in the first place? What we want a more granular resolution? The histogram facet is designed to work on numerals and allows for specification of bucket size at query time.

By using the key_script and value_script attributes in the histogram facet, it allows us to extract proper numerals at query time:

{
"query": {
    "match": {"name": "92-Blindernveien 5"}
},
"facets" : {
        "histo1" : {
            "histogram" : {
                "key_script" : "doc['timestamp'].date.dayOfWeek * 100 + doc['timestamp'].date.hourOfDay",
                "value_script" : "doc['bicycles'].value"
            }
        }
    }

}

And of course, this also allows us to tune the bucket size at query time:

{
"query": {
    "match": {"name": "92-Blindernveien 5"}
},
"facets" : {
        "histo1" : {
            "histogram" : {
                "key_script" : "doc['timestamp'].date.dayOfWeek * 24 * 60 + doc['timestamp'].date.hourOfDay * 60 + doc['timestamp'].date.minuteOfHour",
                "value_script" : "doc['bicycles'].value",
                "interval" : 20
            }
        }
    }
}

You might have noticed that I omitted the curl command in the histogram examples. This has nothing to do with the histogram facet in particular, but the fact they use single quotes (’) in the scripts which would have to be escaped. To run such queries I recommend saving them to a file and using the following command:

curl http://.foundcluster.com:9200/oslo/bikerack/_search?pretty=true -XPOST -d @query.json

Conclusion

In this article we have seen the flexibility of Elasticsearch as a data analysis tool. Go ahead and create a small script and start shoving in some JSON documents, then take it from there. You will probably soon find the need to do some mapping and tweaking of your indexer, and yes, Elasticsearch lets you do that. If there are breaking changes in your mappings, simply create a new index and when it looks good you can create a script to reindex the documents from the old index. The facets are not as flexible as traditional SQL, but they sure are fast, and once you get your head around them they can actually deliver a lot.

Next: City Bikes Part Two - Reindexing and Query Optimization with Filters