Shared Indexedit

We can use a large shared index for the many smaller forums by indexing the forum identifier in a field and using it as a filter:

PUT /forums
{
  "settings": {
    "number_of_shards": 10 
  },
  "mappings": {
    "post": {
      "properties": {
        "forum_id": { 
          "type":  "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}

PUT /forums/post/1
{
  "forum_id": "baking", 
  "title":    "Easy recipe for ginger nuts",
  ...
}

Create an index large enough to hold thousands of smaller forums.

Each post must include a forum_id to identify which forum it belongs to.

We can use the forum_id as a filter to search within a single forum. The filter will exclude most of the documents in the index (those from other forums), and caching will ensure that responses are fast:

GET /forums/post/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "ginger nuts"
        }
      },
      "filter": {
        "term": {
          "forum_id": {
            "baking"
          }
        }
      }
    }
  }
}

This approach works, but we can do better. The posts from a single forum would fit easily onto one shard, but currently they are scattered across all ten shards in the index. This means that every search request has to be forwarded to a primary or replica of all ten shards. What would be ideal is to ensure that all the posts from a single forum are stored on the same shard.

In Routing a Document to a Shard, we explained that a document is allocated to a particular shard by using this formula:

shard = hash(routing) % number_of_primary_shards

The routing value defaults to the document’s _id, but we can override that and provide our own custom routing value, such as forum_id. All documents with the same routing value will be stored on the same shard:

PUT /forums/post/1?routing=baking 
{
  "forum_id": "baking", 
  "title":    "Easy recipe for ginger nuts",
  ...
}

Using forum_id as the routing value ensures that all posts from the same forum are stored on the same shard.

When we search for posts in a particular forum, we can pass the same routing value to ensure that the search request is run on only the single shard that holds our documents:

GET /forums/post/_search?routing=baking 
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "ginger nuts"
        }
      },
      "filter": {
        "term": { 
          "forum_id": {
            "baking"
          }
        }
      }
    }
  }
}

The query is run on only the shard that corresponds to this routing value.

We still need the filtering query, as a single shard can hold posts from many forums.

Multiple forums can be queried by passing a comma-separated list of routing values, and including each forum_id in a terms query:

GET /forums/post/_search?routing=baking,cooking,recipes
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "ginger nuts"
        }
      },
      "filter": {
        "terms": {
          "forum_id": {
            [ "baking", "cooking", "recipes" ]
          }
        }
      }
    }
  }
}

While this approach is technically efficient, it looks a bit clumsy because of the need to specify routing values and terms queries on every query or indexing request. Index aliases to the rescue!