February 20, 2013

Managing Relations Inside Elasticsearch

Data in the real world is rarely simple - often times it is a jumble of interlocking relations.

How do you represent relational data in Elasticsearch? There are a few mechanisms that can be used to provide relation support. Each has their pros and cons, making them useful for different situations.

Inner Objects

The simplest mechanism are named "inner objects". These are JSON objects embedded inside your parent document:

{
  "name":"Zach",
  "car":{
    "make":"Saturn",
    "model":"SL"
  }
}

Simple, right? The "car" field is just another JSON object, with the inner object having two properties ("make" and "model"). This inner object mapping will work as long as you have a one-to-one relationship between the root object and the inner object. E.g. every person has at most one "car".

But what if Zach owns two cars, and we add another person (Bob) who owns just one car?

{
  "name" : "Zach",
  "car" : [
    {
      "make" : "Saturn",
      "model" : "SL"
    },
    {
      "make" : "Subaru",
      "model" : "Imprezza"
    }
  ]
}
{
  "name" : "Bob",
  "car" : [
    {
      "make" : "Saturn",
      "model" : "Imprezza"
    }
  ]
}

Ignoring the fact that Saturn never made an Imprezza car, what happens when we try to search for it? Logically, only Bob has a "Saturn Imprezza", so we should be able to do a query like:

`query: car.make=Saturn AND car.model=Imprezza`

Right? Well, no, that doesn't work. If you perform that query, you'll receive both documents as the result. What happens is that Elasticsearch internally flattens inner objects into a single object. So Zach's entry actually looks like this:

{
  "name" : "Zach",
  "car.make" : ["Saturn", "Subaru"]
  "car.model" : ["SL", "Imprezza"]
}

Which explains why it was returned as a result. Elasticsearch is fundamentally flat, so internally the documents are represented as flattened fields. Hmm.

Nested

As an alternative to inner objects, Elasticsearch provides the concept of " nested types". Nested documents look identical to inner objects at the document level, but provide the functionality we were missing above (as well as some limitations).

Example nested document:

{
  "name" : "Zach",
  "car" : [
    {
      "make" : "Saturn",
      "model" : "SL"
    },
    {
      "make" : "Subaru",
      "model" : "Imprezza"
    }
  ]
}

At the mapping level, nested types must be explicitly declared (unlike inner objects, which are automatically detected):

{
  "person":{
    "properties":{
      "name" : {
        "type" : "string"
      },
      "car":{
        "type" : "nested"
      }
    }
  }
}

The problem with inner objects was that each nested JSON object is not treated as an individual component of the document. Instead they were merged with other inner objects sharing the same property names.

This is not the case with nested documents. Each nested doc remains independent, and you can perform a query like `car.make=Saturn AND car.model=Imprezza` without a problem.

Elasticsearch is still fundamentally flat, but it manages the nested relation internally to give the appearance of nested hierarchy. When you create a nested document, Elasticsearch actually indexes two separate documents (root object and nested object), then relates the two internally. Both docs are stored in the same Lucene block on the same Shard, so read performance is still very fast.

This arrangement does come with some disadvantages. Most obvious, you can only access these nested documents using a special ` nested query`. Another big disadvantage comes when you need to update the document, either the root or any of the objects.

Since the docs are all stored in the same Lucene block, and Lucene never allows random write access to it's segments, updating one field in the nested doc will force a reindex of the entire document.

This includes the root and any other nested objects, even if they were not modified. Internally, ES will mark the old document as deleted, update the field and then reindex everything into a new Lucene block. If your data changes often, nested documents can have a non-negligible overhead associated with reindexing.

Lastly, it is not possible to "cross reference" between nested documents. One nested doc cannot "see" another nested doc's properties. For example, you are not able to filter on "A.name" but facet on "B.age". You can get around this by using `include_in_root`, which effectively copies the nested docs into the root, but this get's you back to the problems of inner objects.

Parent/Child

The last method that Elasticsearch provides are Parent/Child types. This scheme is a looser coupling than nested, and gives you a set of slightly more powerful queries. Let's look at an example where a single person has multiple homes (in different states). Your parent has a mapping as usual, perhaps:

{
  "mappings":{
    "person":{
      "name":{
        "type":"string"
      }
    }
  }
}

While your children have their own mapping outside the parent, with a special `_parent` property set:

{
  "homes":{
    "_parent":{
      "type" : "person"
    },
    "state" : {
      "type" : "string"
    }
  }
}

The `_parent` field tells Elasticsearch that the "Employers" type is a child of the "Person" type. Adding documents to this scheme is very easy. The parent doc is indexed as normal:

$ curl -XPUT localhost:9200/test/person/zach/ -d'
{
  "name" : "Zach"
}

And indexing children documents is almost like normal, except you need to specify which parent this child belongs to in the query parameter ('zach' in this case, which is the ID that we used in the above document):

$ curl -XPOST localhost:9200/homes?parent=zach -d'
{
  "state" : "Ohio"
}
$ curl -XPOST localhost:9200/test/homes?parent=zach -d'
{
  "state" : "South Carolina"
}

Both of these documents are now associated with the 'zach' parent document, which allows you to use special queries such as:

Has Parent Query, which works on parent documents and return children.
Has Child Query, which works on children documents and returns parents

You can also query the parents or children types individually, since they are first-class types and will respond to normal queries as usual (you just can't use the relationship values).

The big problem with Nested was their storage: everything is stored in the same Lucene block. Parent/Child removes this limitation by separating the two documents and only loosely coupling them. There are some pros and cons to this. The loose coupling means you are more free to update/delete children docs, since they have no effect on the parent or other children.

The downside is that Parent/Child are slightly less performant than Nested. The children docs are routed to the same shard as the parent, so they will still benefit from shard-level caches and memory filtering. But they aren't quite as fast as nested since they are not colocated in the same Lucene block. There is also a bit more memory overhead, since ElasticSearch needs to keep an in-memory "join table", which manages the relations.

Lastly, you'll run into situations where sorting or scoring are, frankly, very difficult. For example, it is impossible to tell which child documents matched your `Has_Child` filter, just that one of the docs of the returned parent matched the criteria. This can be frustrating depending on your use-case.

Denormalization

Sometimes the best option is to simply denormalize your data where appropriate. The relational facilities that Elasticsearch provides are great for certain scenarios...but were never meant to provide the robust relational features that you expect from an RDBM.

At it's heart, Elasticsearch is a flat hierarchy and trying to force relational data into it can be very challenging. Sometimes the best solution is to judiciously choose which data to denormalize, and where a second query to retrieve children is acceptable. Denormalization gives you arguably the most power and flexibility.

Of course, this comes with the burden of administrative overhead. You get to manage relations, and you get to perform the required queries/filters to associate the various types. Yay!

Conclusion and Recap

This turned into a long, wordy article, so here is a bulletted recap:

Inner Object

Easy, fast, performant
Only applicable when one-to-one relationships are maintained
No need for special queries

Nested

Nested docs are stored in the same Lucene block as each other, which helps read/query performance. Reading a nested doc is faster than the equivalent parent/child.
Updating a single field in a nested document (parent or nested children) forces ES to reindex the entire nested document. This can be very expensive for large nested docs
"Cross referencing" nested documents is impossible
Best suited for data that does not change frequently

Parent/Child

Children are stored separately from the parent, but are routed to the same shard. So parent/children are slightly less performance on read/query than nested
Parent/child mappings have a bit extra memory overhead, since ES maintains a "join" list in memory
Updating a child doc does not affect the parent or any other children, which can potentially save a lot of indexing on large docs
Sorting/scoring can be difficult with Parent/Child since the Has Child/Has Parent operations can be opaque at times

Denormalization

You get to manage all the relations yourself!
Most flexible, most administrative overhead
May be more or less performant depending on your setup