Index and search analysis

Text analysis occurs at two times:

Index time
When a document is indexed, any text field values are analyzed.
Search time

When running a full-text search on a text field, the query string (the text the user is searching for) is analyzed.

Search time is also called query time.

The analyzer, or set of analysis rules, used at each time is called the index analyzer or search analyzer respectively.

How the index and search analyzer work together

In most cases, the same analyzer should be used at index and search time. This ensures the values and query strings for a field are changed into the same form of tokens. In turn, this ensures the tokens match as expected during a search.

Example

A document is indexed with the following value in a text field:

The QUICK brown foxes jumped over the dog!

The index analyzer for the field converts the value into tokens and normalizes them. In this case, each of the tokens represents a word:

[ quick, brown, fox, jump, over, dog ]

These tokens are then indexed.

Later, a user searches the same text field for:

"Quick fox"

The user expects this search to match the sentence indexed earlier, The QUICK brown foxes jumped over the dog!.

However, the query string does not contain the exact words used in the document’s original text:

  • quick vs QUICK
  • fox vs foxes

To account for this, the query string is analyzed using the same analyzer. This analyzer produces the following tokens:

[ quick, fox ]

To execute the serach, Elasticsearch compares these query string tokens to the tokens indexed in the text field.

Token Query string text field

quick

X

X

brown

X

fox

X

X

jump

X

over

X

dog

X

Because the field value are query string were analyzed in the same way, they created similar tokens. The tokens quick and fox are exact matches. This means the search matches the document containing "The QUICK brown foxes jumped over the dog!", just as the user expects.

When to use a different search analyzer

While less common, it sometimes makes sense to use different analyzers at index and search time. To enable this, Elasticsearch allows you to specify a separate search analyzer.

Generally, a separate search analyzer should only be specified when using the same form of tokens for field values and query strings would create unexpected or irrelevant search matches.

Example

Elasticsearch is used to create a search engine that matches only words that start with a provided prefix. For instance, a search for tr should return tram or trope—but never taxi or bat.

A document is added to the search engine’s index; this document contains one such word in a text field:

"Apple"

The index analyzer for the field converts the value into tokens and normalizes them. In this case, each of the tokens represents a potential prefix for the word:

[ a, ap, app, appl, apple]

These tokens are then indexed.

Later, a user searches the same text field for:

"appli"

The user expects this search to match only words that start with appli, such as appliance or application. The search should not match apple.

However, if the index analyzer is used to analyze this query string, it would produce the following tokens:

[ a, ap, app, appl, appli ]

When Elasticsearch compares these query string tokens to the ones indexed for apple, it finds several matches.

Token appli apple

a

X

X

ap

X

X

app

X

X

appl

X

X

appli

X

This means the search would erroneously match apple. Not only that, it would match any word starting with a.

To fix this, you can specify a different search analyzer for query strings used on the text field.

In this case, you could specify a search analyzer that produces a single token rather than a set of prefixes:

[ appli ]

This query string token would only match tokens for words that start with appli, which better aligns with the user’s search expectations.