Tech Topics

All About Elasticsearch Filter BitSets

WARNING: This article contains outdated information. We no longer recommend taking its advice.

When building queries in Elasticsearch, you’ll often find yourself composing sets of filters. Say you need to filter users that are:

  • Gender: Male
  • Age: 23-26
  • Language: English

Generally you should place these filters inside of a Boolean Filter.

But wait…isn’t there an And Filter? What about the Or Filter and Not Filters? Are they simply an alternative syntax to the Bool filter?

The short answer is: No, they are very distinct. More importantly, they can have a big impact on your query performance.

BitSets

The key to understanding the difference will require a detour into BitSets. Fundamentally, a BitSet is just an array of bits that represent state. A position in the bit array is either a 1 or a 0.

Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.

The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.

The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits.

How do I love thee BitSets? Let me count the ways…

It is hard to overstate how fast bitwise operations are.

Bitwise operations are such a basic unit of computation that a CPU has dedicated bitwise circuitry. Performing an in-memory bitwise AND is orders of magnitude faster than parsing Lucene data structures and performing the intersection manually.

BitSets play nicely with other filter BitSets too. Have three filters? Elasticsearch will do a bitwise AND on all three bitsets to derive your final matching set of documents.

Even better, BitSets are cached independent of the query itself. A complex query could have a dozen filters, but each of those filter BitSets are independant and entirely reusable in other contexts. This allows Elasticsearch to be very efficient with filter reuse.

Furthermore, since BitSets are stored on a per-segment basis, Elasticsearch can do some cool performance tricks. Lucene segments are immutable – once written to disk, they will never change.

If a particular filter doesn’t match any documents in a segment (the BitSet is all zeros), in the future Elasticsearch can ignore that entire BitSet when performing it’s filtering operation.

Similarly, as new segments are added, cached filter BitSets do not need to be invalidated. If you are indexing new documents into a MySQL table, for example, the B-Tree index is constantly updating due to the nature of B-Trees (they are sorted data structures).

With Elasticsearch filter caches, only newly created segments need to build filter BitSet; old BitSets can be reused without modification.

Boolean vs. And/Or/Not

“Ok, we get it. BitSets are cool! Why does this concern me?” you may be asking yourself.

It matters because the Bool filter utilizes BitSets while the And/Or/Not filters do not. If you put a Terms Filter inside of an And…no BitSet will be used, even though it exists.

Why?

And/Or/Not work on a doc-by-doc basis. They first load whatever data is needed into the Field Data memory pool then iterate over the docs, excluding as they go. BitSets are not used and there is no composability from cached filters. Elasticsearch is simply scanning down the list of documents and checking each one individually

If you have multiple Filters, And/Or/Not will “short-circuit” the operation, only passing the matching docs to the next filter.

This reduces the amount of work that each subsequent filter needs to perform. For this reason, your “heaviest” filter should always be placed last – typically Geo filters since they can perform some heavy computations to determine distance.

When to use And/Or/Not

It may seem that Bool filters are superior in every regard. Is there a time to use And/Or/Not?

And/Or/Not are more performant when you are working with filters that do not return a bitset. These are operations that must iterate over every document anyway. For example, a custom script is not BitSet-able because it performs computation on every document.

In these cases, And/Or/Not is a better choice than the Bool. The list of Non-Bitset filters is very small, so it is easy to remember:

  • Geo* filters
  • Scripts
  • Numeric_range

That’s it! Every other filter should be placed inside of a Bool.

Combining Bool with And/Or/Not

You’ll often run into situations where you need several BitSet filters and several Non-BitSet filters. In these situations, combine Bool with And/Or/Not where appropriate. Always wrap the entire composition with an And/Or/Not. So for example, if you have:

  • Gender: Male
  • Age: 23-26
  • Language: English
  • Custom Script
  • Geo

Your filter list may look something like this:

{
  "and" : [
    {
      "bool" : {
        "must" : [
          { "term" : {} },
          { "range" : {} },
          { "term" : {} }
        ]
      }
    },
    { "custom_script" : {} },
    { "geo_distance" : {} }
  ]
}

Conclusion

Filters enable you to find documents that you want, but also boost query performance by offloading simple exclusion tasks to fast BitSet operations. When composing filters, make sure you spend a few extra minutes organizing them to use the appropriate class of aggregating filter:

  • Geo, Script or Numeric_range filter: Use And/Or/Not Filters
  • Everything else: Use Bool Filter

Some additional sources: