February 11, 2014

Troubleshooting Elasticsearch searches, for Beginners

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

We'll look at common problems when getting started with Elasticsearch.

Introduction

Elasticsearch's recent popularity is in large part due to its ease of use. It's fairly simple to get going quickly with Elasticsearch, which can be deceptive. Here at Found we've noticed some common pitfalls new Elasticsearch users encounter. Consider this article a piece of necessary reading for the new Elasticsearch user; if you don't know these basic techniques take the time to familiarize yourself with them now, you'll save yourself a lot of distress.

Specifically, this article will focus on text transformation, more properly known as text analysis, which is where we see a lot of people get tripped up. Having used other databases, the fact that all data is transformed before getting indexed can take some getting used to. Additionally, “schema free” means different things for different systems, a fact that is often confused with Elasticsearch's “Schema Flexible” design.

Why Isn't My Search Returning What I Expect?

Many of the questions I've come across from developers on either StackOverflow or elsewhere fall into this category. Somehow, you can't get your result even if you search for exactly what appears in your document. Not even with some wildcards added, which should really get it to work. Yet, things that don't even occur in your documents show up. Why?

This is typically related to text being “analysed” - a type of transformation - into index terms that are different from what you search for. For example, if you have the text "Text-analyzer", the standard analyzer will produce the terms "text" and "analyzer" for indexing. That's what can be searched for. The term "Text-analyzer" is never added to the index. Anything that tries to look for that exact term in the index will not find anything.

Consequently, it's important that the way in which the text is analyzed when searching is compatible with how the text was transformed during indexing. In the previous simple example, this means that terms must be correctly tokenized and lowercased. Some query types perform text analysis automatically, while most do not. None of the filters perform any text processing.

Specifically, the match and query_string family of queries process text. That is why a match query for "Text-analyzer" will return results, while a term query or filter for exactly the same will not. The match query uses the same analyzer and is looking up "text" and "analyzer" in the index, while a term query for "Text-analyzer" looks for exactly that, which does not exist.

Many users try to fix this by adding wildcards around their search input, akin to the SQL WHERE column ILIKE '%query%' which slows SQL databases to a crawl. This only guarantees that searches will be slow, not that you find what you want! For example, a wildcard search for *Text-analyzer* would have to look through all the terms in the index, most likely not finding it.

Problems are also frequently encountered due to the standard analyzer performing stop word removal, i.e. removing words like "it", "no", "be" and so on. This can be very annoying when you index e.g. country codes. standard is currently the default analyzer. While this will change in Elasticsearch 1.0, relying on the default analyzer is rarely what you want. While Elasticsearch can do (and does) a good job of guessing types for non-string values, it can't possibly know how to properly treat your text. Is it a name, a tag or prose?

On the flip side, sometimes, when the mappings are wrong, you'll find that documents that should not match, in fact do. For example, if the date "2014-01-02" is indexed as a string with the standard analyzer, a match query for 2014-02-01 (a completely different date) will match, because the query is for the terms "2014", "02" and "01" as an OR.

Meticulous Mappings

As shown, incorrectly processed text is the root cause of many of the most common problems. It is in turn caused by incorrect or incomplete mappings, exacerbated by relying on the “schema free”-ness of dynamic mapping.

Changing mappings is a lot of work that can take considerable planning to do in production. This leads to many questions ending with “… without changing the mapping”. Adjusting searches to deal with suboptimal mappings is a slippery slope, leading to overcomplicated queries and poor performance. Don't do that.

There is no such thing as a schema free database that can deal very well with textual search. We need to index according to how we are going to search, instead of trying to construct searches according to how things are stored.

There are several good resources from which you can learn about text processing and mapping with Elasticsearch. While you do not need to understand mapping to quickly get data into Elasticsearch and do basic searches, it is very important to get a grasp of at least the basics of text analysis and mapping:

An Introduction to Elasticsearch Mapping covers how mappings work and are specified, while A Data Exploration Workflow for Mappings covers how to experiment with them.
All About Analyzers, Part One provides a good foundation on how text is processed, step by step.

Finally, Clinton Gormley's article on changing mappings with zero downtime is well worth a read.

Key/Value Woes

Another category of problems we see is caused by developers trying to use Elasticsearch as a generic key-value store, again relying on the “schema free”-ness. If the keys are actually based on a value, the mapping Elasticsearch derives from your data will grow without bound.

For example, let's assume you have a document on the following form. This could be an application where users can define their own questionnaire. For example, "favorite_character" and "best_quote" are keys provided by the creator of the questionnaire. favorite_character and other keys below answers can be anything specified by a user of the application.

# First document
answered_by: Some guy
questionaire_id: 123
answers:
    favorite_character: Tyrion Lannister
    best_quote: A mind needs books as a sword needs a whetstone, if it is to keep its edge.
---
# Second document
answered_by: Someone else
questionaire_id: 42
answers:
    # Completely different keys here
    search_experience: 8
    familiar_buzzwords: [nosql, sql, devops, big data]

Initially, this can seem like a good idea, letting you easily let your users create custom forms and searches without having a rigid schema.

The problem, however, is that all the keys (e.g. "answers.favorite_character", “answers.familiar_buzzwords", …) will end up in the index's mappings, and so on for any other key. This isn't a problem when the amount of data is small, but will quickly become a problem as the mapping grows. A few thousand questionnaires with custom schemas later, and you'll start noticing there's a memory cost to every field, and that mappings are included in the cluster state which is continuously distributed between nodes.

If you have keys that change based on your values, you almost always want to restructure your documents so the keys are fixed. You probably want to learn about nested documents. They are outside the scope of this article, but the above example could be rewritten to the following when using nested documents. The mapping will then only have properties for "answers.key", "answers.keyword_value" and "answers.text_value" and will not grow as you add more documents. You can also easily customize mappings so that "keyword_value" can be indexed as not_analyzed and thus facetable, while "text_value" uses a custom analyzer.

answered_by: Some guy
answers:
    - key: favorite_character
      keyword_value: Tyrion Lannister
    - key: best_quote
      text_value: A mind needs books as a sword needs a whetstone, if it is to keep its edge.

Here are some good resources on using nested documents or parent/child-relations with Elasticsearch:

Managing relations inside elasticsearch by Zachary Tong
Document relations with Elasticsearch (video) by Martijn van Groningen at Berlin Buzzwords 2013

Why Isn't the Document I Expect the Top Hit?

Relevancy and scoring are complex subjects which have been the subject of much research. A document's score is a combination of textual similarity (which is what information retrieval texts largely cover), and metadata based scores (or “signals”) influenced by things like a document's view count, likes, and location in the information structure. Combining, tweaking, and maintaining these models is an art in itself, and something we will cover extensively in future articles.

Inspecting the number crunching and seeing what terms and properties influence a document's score can be done by setting explain to true on your search object. Unfortunately, there are no tools available at this point to make the output easy to digest. (There is an explain visualizer for Solr which can be worth having a look at to better grasp the concepts, which are the same.)

Below is an example of a somewhat simplified explain tree that shows how scores are calculated for a simple multi_match-query ontitle^3 and lyrics, i.e. hits in the title are boosted. These quickly get quite large, with queries that match on many fields and combine various signals using e.g. function score queries. Nevertheless, they are very useful to quickly spot which factors dominate the score calculation. In this example, as you would expect, the match in the title field contributes considerably more to the total score than the hit in lyrics.

value: 0.16166611
description: 'sum of:'
details:
- value: 0.12735893
  description: 'weight(title:never^3.0 in 0) [PerFieldSimilarity], result of:'
  details:
  - value: 0.12735893
    description: 'score(doc=0,freq=1.0 = termFreq=1.0), product of:'
    details:
    - value: 0.94868326
      description: 'queryWeight, product of:'
      details:
      - description: boost
        value: 3.0
      - description: idf(docFreq=1, maxDocs=1)
        value: 0.30685282
      - description: queryNorm
        value: 1.0305519
    - ... # Omitted
 - value: 0.034307186
   description: 'weight(lyrics:never in 0) [PerFieldSimilarity], result of:'
   details:
      # ...

For more information on what the different factors like e.g. idf mean, Similarity in Elasticsearch is a good read.

Debugging Searches

Searches quickly become big, and composed of many different queries and filters. It is a good idea to start by verifying that your assumptions about the inner-most nested queries and filters and work your way upwards. Make sure you also include some documents that should not match when making a sample set.

As you gain more experience with Elasticsearch, you'll get a good feel of what types of searches/queries/filters/facets will stress your CPU, I/O or memory, and whether it will cause your mappings to explode like described above. But even then, you should always subject your new configuration to realistic amounts of data in a controlled environment as well.

Summary

We've briefly mentioned a few topics to motivate learning more about them. As with most interesting problems, there are no simple recipes for finding a solution.

Here are some important questions to consider whenever you encounter problems like those mentioned here:

What terms is my query or filter actually looking for in the index?
What terms are actually indexed?
Is this query or filter type analyzing the text?
Is the inner queries or filters acting like I assume they are?
Do my queries/filters AND or OR terms by default?
What parts of my query contribute most of the score? (Use Explain!)
Am I in control of what the mapping will look like, or do I rely on Elasticsearch?
Can I handle growing amounts of data with this combination of mappings and searches?

If you have other tips that has helped you debug a tricky search, please contribute in the discussion below!