Edge n-gram tokenizer | Elasticsearch Guide [7.5]

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

› › ›

« Classic Tokenizer Limitations of the max_gram parameter »

Edge n-gram tokenizeredit

The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.

Edge N-Grams are useful for search-as-you-type queries.

When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.

Example outputedit

With the default settings, the edge_ngram tokenizer treats the initial text as a single token and produces N-grams with minimum length 1 and maximum length 2:

POST _analyze
{
  "tokenizer": "edge_ngram",
  "text": "Quick Fox"
}

The above sentence would produce the following terms:

[ Q, Qu ]

These default gram lengths are almost entirely useless. You need to configure the edge_ngram before using it.

Configurationedit

The edge_ngram tokenizer accepts the following parameters:

min_gram

Minimum length of characters in a gram. Defaults to 1.

max_gram

Maximum length of characters in a gram. Defaults to 2.

See Limitations of the max_gram parameter.

token_chars

Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).

Character classes may be any of the following:

letter — for example a, b, ï or 京
digit — for example 3 or 7
whitespace — for example " " or "\n"
punctuation — for example ! or "
symbol — for example $ or √

« Classic Tokenizer Limitations of the max_gram parameter »