Tech Topics

Customizing Your Document Routing

In this article, we are going to take a look at Elasticsearch’s Routing feature. We’ll discuss what routing is and why you might consider using a non-default routing scheme.

What is Routing?

All of your data lives in a primary shard, somewhere in the cluster. You may have five shards or five hundred, but any particular document is only located in one of them. Routing is the process of determining which shard that document will reside in.

Because Elasticsearch tries hard to make defaults work for 90% of users, routing is handled automatically. For most users, it doesn’t matter where a document is stored.

The default routing scheme hashes the ID of a document and uses that to find a shard. This includes both user-provided IDs and randomly generated IDs picked by Elasticsearch. Default routing gives an even distribution of documents across the entire set of shards – you won’t have any “hotspots” where documents tend to cluster on one shard or another.

The need for custom routing

Random routing works well most of the time, but there are scenarios where domain knowledge of your data can lead to better performance with a custom routing solution. Imagine a single index with 20 shards (overallocated to support future growth).

What happens when a search request is executed on the cluster?

  1. Search request hits a node
  2. The node broadcasts this request to every shard in the index (either a primary or replica shard)
  3. Each shard performs the search query and responds with results
  4. Results are merged on the gateway node, sorted and returned to the user

Elasticsearch has no idea where to look for your document. All the docs were randomly distributed around your cluster…so Elasticsearch has no choice but to broadcasts the request to all 20 shards. This is a non-negligible overhead and can potentially impact performance.

Wouldn’t it be nice if we could tell Elasticsearch which shard the document lived in? Then you would only have to search one shard to find the document(s) that you need.

This is exactly what custom routing does.

Instead of blindly broadcasting to all shards, you tell Elasticsearch “Hey! Search for the data on this shard! It’s all there, I promise!”. For example, you could route documents based on their user_id. Or their zip or postcode. Or whatever is commonly searched/filtered in your application.

Routing ensures that all documents with the same routing value will locate to the same shard, eliminating the need to broadcast searches. This design pattern is often called the “User Data Flow”, popularized from Shay’s talk at Berlin Buzzwords.

Custom Routing is Fast

This point is so important, so central to custom routing, that it deserves its own section. Custom routing ensures that only one shard is queried.

This has the potentially to increase performance noticeably, if your problem fits into the niche that custom routing serves.

It doesn’t matter if you have 20 or 100 shards in your index, custom routing ensures that only the shards holding your data are queried. Under the right data organization, this can be the difference between a cluster that is struggling and one that doesn’t break a sweat.

Setting up your custom routing

To begin using custom routing, you need to tell Elasticsearch that you want certain types to use your routing instead of default. This is done by setting the mapping for a particular type:

$ curl -XPUT 'http://localhost:9200/store/order/_mapping' -d '
{
   "order":{
      "_routing":{
         "required":true
      }
   }
}'
Note: setting “required” to true is not explicitly required to use routing, but it is a good practice. If set to true, Elasticsearch will raise errors when you attempt certain operations without specifying a routing value. This will save you a lot of heartache.

This mapping says that every document in the “order” type must specify a custom routing. If you try to index a document without specify a routing, the operation will fail with an error. Indexing a document with custom routing is very easy – you just specify the routing in the URI:

$ curl -XPOST 'http://localhost:9200/store/order?routing=user123' -d '
{
   "productName":"sample",
   "customerID":"user123"
}

Extracting routing from the document

Sometimes it is necessary to derive the routing value from the document itself. Elasticsearch can automatically extract the routing information if you tweak the mapping just slightly and add a path parameter:

<del>$ curl -XPUT 'http://localhost:9200/store/order/_mapping' -d '
{
   "order":{
      "_routing":{
         "required":true,
         "path":"customerID"
      }
   }
}'
</del>

Now when you index, you can omit the routing parameter on the URI:

<del>$ curl -XPOST 'http://localhost:9200/store/order' -d '
{
   "productName":"sample",
   "customerID":"user123"
}'
</del>

Note: even though extraction is convenient, it is less performant than specifying the routing in the URI. The gateway node must load the document, extract the routing value, then serialize the document to the final shard/node. If you specify the routing in the URI, Elasticsearch can immediately serialize the doc to the appropriate shard without any overhead of loading the doc.

WARNING: extracting custom routing from the document is no longer supported; it was removed in Elasticsearch v2.0.  The ability to extract routing from the document itself can lead to strange edge-cases and potential trouble (e.g. if the field is multi-valued, which value do you use for routing?)  If you are <v2.0, it is not recommended to use extraction.

Searching with Routing

Great! You have routing setup! So how do you start searching with routing? It is easy. Simply append the appropriate routing parameter to the URI:

$ curl -XGET 'http://localhost:9200/store/order/_search?routing=user123' -d '
{
   "query":{
      "filtered":{
         "query":{
            "match_all":{}
         },
         "filter":{
            "term":{
               "userID":"user123"
            }
         }
      }
   }
}'

This query will find all the orders of user “user123″. Notice that in this particular example, a filtered query is essential.

If we instead just used the match_all:

$ curl -XGET 'http://localhost:9200/store/order/_search?routing=user123' -d '
{
   "query":{
      "match_all":{}
   }
}'

What you are actually asking Elasticsearch is “Match all documents that reside on the shard which ‘user123′ routes to”.

This shard almost certainly has documents other than those belonging to user123. Routing ensures that documents of a particular routing value all go to the same shard…but that doesn’t mean that other documents aren’t routed to the shard too.

So make sure you include filters where appropriate. :)

Searching multiple routing values

When a document is indexed, you may only specify a single routing value. It doesn’t make sense for a document to be routed to multiple shards!

But when searching, you can specify more than one routing value to search on. Imagine an ACL system where routing is based on a UserGroup. A user can be a member of several groups, so when searching you want to search all documents that the user has access to.

Each UserGroup is sandboxed to a single shard and with multiple routing values you can search just the shards that are relevant:

$ curl -XGET 'http://localhost:9200/forum/posts?routing=Admin,Moderator' -d '{...}'

That will search both the “Admin” and the “Moderator” shard, while ignoring every other UserGroup.

Considerations and complications

There are a several gotchas that you should be aware of when implementing a custom routing solution. 

Parent/Child/Grandchild schemes

Parent/Child mappings ensure that all children are routed to the same shard as the parent for performance reasons. Internally, Elasticsearch is setting the child’s routing value equal to the ID of the parent, ensuring that everyone colocates to the same shard.

However, if you add a third tier – grandchildren – parent/child mapping fails. Look at these sample documents:

curl -XPUT localhost:9200/parentchild/product/Product001 -d '{...}'
curl -XPUT localhost:9200/parentchild/vendors/VendorABC?parent=Product001 -d '{...}'
curl -XPUT localhost:9200/parentchild/vendordetails/LocationXYZ?parent=VendorABC -d '{...}'

Fairly straightforward nested relationships. Products have children called “Vendors”, while those have their own children called “VendorDetails”. So why doesn’t this work? Let’s look at the derived routing values:

Doc ID Parent Routing Value
Product001 - Product001
VendorABC Product001 Product001
LocationXYZ VendorABC VendorABC

As you can see, docs will automatically use the ID of their parent for routing. VendorABC uses Product001 (fine so far), but LocationXYZ uses VendorABC (bad). This means data is not being colocated correctly.

The solution is simple: tell the grandchild (and any subsequent great-grandchildren) that it’s routing value should be the ID of the grandparent. Internally, Elasticsearch will give preference to the routing parameter over the parent parameter:

curl -XPUT localhost:9200/parentchild/product/Product001 -d '{...}'
curl -XPUT localhost:9200/parentchild/vendors/VendorABC?parent=Product001 -d '{...}'
curl -XPUT localhost:9200/parentchild/vendordetails/LocationXYZ?parent=VendorABC&routing=Product001 -d '{...}'
Doc ID Parent Routing Value
Product001 - Product001
VendorABC Product001 Product001
LocationXYZ VendorABC Product001

Hotspots

When Elasticsearch manages the routing, it ensures that distribution is fairly uniform across all your shards. However, once you start implementing your own custom schemes, it is entirely possible that this uniformity is lost. Say you are routing by userID. Most of your users are small and have a handful of documents…but occasionally you get a monster user that has millions.

Instead of being distributed uniformly around the cluster, custom routing will ensure that all the documents of MonsterUser go to a single shard. That’s good for performance – searches will execute quickly.

But what happens if MonsterUser2 and MonsterUser3 also get allocated to the same shard? Now you have three big users located on a single shard while the rest of your shards are only lightly loaded.

Not a great situation. These types of scenarios can and do happen. The best defense against these “data hotspots” is to manually identify large (or otherwise performance-hungry) users and split them into their own index. You can then setup an alias to make the separation transparent to your application.

Now you have the best of both worlds: custom routing to keep most users sandboxed to a single shard, while big users are split into their own index and set of shards. Cool!

You can check-out any time you like, but you can never leave!

Once you specify custom routing when indexing a document, you must keep specifying it whenever you want to update it or when you want to delete it. You've taken some control from Elasticsearch and there's no easy way to give this control back if you decide you want to stop using custom routing at some point. The only way is to reindex your data (without specifying custom routing).

Conclusion

Custom routing is a powerful feature that can boost performance in specific situations. While default routing will suffice for most people, sometimes you just need more control over the placement of documents.