Get tokens from text analysis | Elasticsearch API documentation (8x-unreleased)

Get tokens from text analysis Generally available

POST /{index}/_analyze

The analyze API performs analysis on a text string and returns the resulting tokens.

Generating excessive amount of tokens may cause a node to run out of memory. The index.analyze.max_token_count setting enables you to limit the number of tokens that can be produced. If more than this limit of tokens gets generated, an error occurs. The _analyze endpoint without a specified index will always use 10000 as its limit. ##Required authorization

Index privileges: index

External documentation

Path parameters

index string Required

Index used to derive the analyzer. If specified, the analyzer or field parameter overrides this value. If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer.

Query parameters

index string

Index used to derive the analyzer. If specified, the analyzer or field parameter overrides this value. If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer.

application/json

Body

analyzer string

The name of the analyzer that should be applied to the provided text. This could be a built-in analyzer, or an analyzer that’s been configured in the index.
attributes array[string]

Array of token attributes used to filter the output of the explain parameter.
char_filter array[string | object]

Array of character filters used to preprocess characters before the tokenizer.
One of:
CharFilter string HtmlStripCharFilter object MappingCharFilter object PatternReplaceCharFilter object IcuNormalizationCharFilter object KuromojiIterationMarkCharFilter object

External documentation
Hide attributes Show attributes

version string

type string Required

Value is html_strip.

escaped_tags array[string]
Hide attributes Show attributes

version string

type string Required

Value is mapping.

mappings array[string]

mappings_path string
Hide attributes Show attributes

version string

type string Required

Value is pattern_replace.

flags string

pattern string Required

replacement string
Hide attributes Show attributes

version string

type string Required

Value is icu_normalizer.

mode string

Values are decompose or compose.

name string

Values are nfc, nfkc, or nfkc_cf.

unicode_set_filter string
Hide attributes Show attributes

version string

type string Required

Value is kuromoji_iteration_mark.

normalize_kana boolean Required

normalize_kanji boolean Required
explain boolean

If true, the response includes token attributes and additional details.
field string

Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.
filter array[string | object]

Array of token filters used to apply after the tokenizer.
One of:
TokenFilter string ApostropheTokenFilter object ArabicNormalizationTokenFilter object AsciiFoldingTokenFilter object CjkBigramTokenFilter object CjkWidthTokenFilter object ClassicTokenFilter object CommonGramsTokenFilter object ConditionTokenFilter object DecimalDigitTokenFilter object DelimitedPayloadTokenFilter object EdgeNGramTokenFilter object ElisionTokenFilter object FingerprintTokenFilter object FlattenGraphTokenFilter object GermanNormalizationTokenFilter object HindiNormalizationTokenFilter object HunspellTokenFilter object HyphenationDecompounderTokenFilter object IndicNormalizationTokenFilter object KeepTypesTokenFilter object KeepWordsTokenFilter object KeywordMarkerTokenFilter object KeywordRepeatTokenFilter object KStemTokenFilter object LengthTokenFilter object LimitTokenCountTokenFilter object LowercaseTokenFilter object MinHashTokenFilter object MultiplexerTokenFilter object NGramTokenFilter object NoriPartOfSpeechTokenFilter object PatternCaptureTokenFilter object PatternReplaceTokenFilter object PersianNormalizationTokenFilter object PorterStemTokenFilter object PredicateTokenFilter object RemoveDuplicatesTokenFilter object ReverseTokenFilter object ScandinavianFoldingTokenFilter object ScandinavianNormalizationTokenFilter object SerbianNormalizationTokenFilter object ShingleTokenFilter object SnowballTokenFilter object SoraniNormalizationTokenFilter object StemmerOverrideTokenFilter object StemmerTokenFilter object StopTokenFilter object SynonymGraphTokenFilter object SynonymTokenFilter object TrimTokenFilter object TruncateTokenFilter object UniqueTokenFilter object UppercaseTokenFilter object WordDelimiterGraphTokenFilter object WordDelimiterTokenFilter object JaStopTokenFilter object KuromojiStemmerTokenFilter object KuromojiReadingFormTokenFilter object KuromojiPartOfSpeechTokenFilter object IcuCollationTokenFilter object IcuFoldingTokenFilter object IcuNormalizationTokenFilter object IcuTransformTokenFilter object PhoneticTokenFilter object DictionaryDecompounderTokenFilter object

External documentation
Hide attributes Show attributes

version string

type string Required

Value is apostrophe.
Hide attributes Show attributes

version string

type string Required

Value is arabic_normalization.
Hide attributes Show attributes

version string

type string Required

Value is asciifolding.

preserve_original boolean | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedboolean boolean Stringifiedboolean string
Hide attributes Show attributes

version string

type string Required

Value is cjk_bigram.

ignored_scripts array[string]

Array of character scripts for which to disable bigrams.

Values are han, hangul, hiragana, or katakana.

output_unigrams boolean

If true, emit tokens in both bigram and unigram form. If false, a CJK character is output in unigram form when it has no adjacent characters. Defaults to false.
Hide attributes Show attributes

version string

type string Required

Value is cjk_width.
Hide attributes Show attributes

version string

type string Required

Value is classic.
Hide attributes Show attributes

version string

type string Required

Value is common_grams.

common_words array[string]

A list of tokens. The filter generates bigrams for these tokens. Either this or the common_words_path parameter is required.

common_words_path string

Path to a file containing a list of tokens. The filter generates bigrams for these tokens. This path must be absolute or relative to the config location. The file must be UTF-8 encoded. Each token in the file must be separated by a line break. Either this or the common_words parameter is required.

ignore_case boolean

If true, matches for common words matching are case-insensitive. Defaults to false.

query_mode boolean

If true, the filter excludes the following tokens from the output:

Unigrams for common words

Unigrams for terms followed by common words Defaults to false. We recommend enabling this parameter for search analyzers.
Hide attributes Show attributes

version string

type string Required

Value is condition.

filter array[string] Required

Array of token filters. If a token matches the predicate script in the script parameter, these filters are applied to the token in the order provided.

script object Required

Hide script attributes Show script attributes object

source

id string

params object

Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.

lang

options object
Hide attributes Show attributes

version string

type string Required

Value is decimal_digit.
Hide attributes Show attributes

version string

type string Required

Value is delimited_payload.

delimiter string

Character used to separate tokens from payloads. Defaults to |.

encoding string

Values are int, float, or identity.
Hide attributes Show attributes

version string

type string Required

Value is edge_ngram.

max_gram number

Maximum character length of a gram. For custom token filters, defaults to 2. For the built-in edge_ngram filter, defaults to 1.

min_gram number

Minimum character length of a gram. Defaults to 1.

side string

Values are front or back.

preserve_original boolean | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedboolean boolean Stringifiedboolean string
Hide attributes Show attributes

version string

type string Required

Value is elision.

articles array[string]

List of elisions to remove. To be removed, the elision must be at the beginning of a token and be immediately followed by an apostrophe. Both the elision and apostrophe are removed. For custom elision filters, either this parameter or articles_path must be specified.

articles_path string

Path to a file that contains a list of elisions to remove. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each elision in the file must be separated by a line break. To be removed, the elision must be at the beginning of a token and be immediately followed by an apostrophe. Both the elision and apostrophe are removed. For custom elision filters, either this parameter or articles must be specified.

articles_case boolean | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedboolean boolean Stringifiedboolean string
Hide attributes Show attributes

version string

type string Required

Value is fingerprint.

max_output_size number

Maximum character length, including whitespace, of the output token. Defaults to 255. Concatenated tokens longer than this will result in no token output.

separator string

Character to use to concatenate the token stream input. Defaults to a space.
Hide attributes Show attributes

version string

type string Required

Value is flatten_graph.
Hide attributes Show attributes

version string

type string Required

Value is german_normalization.
Hide attributes Show attributes

version string

type string Required

Value is hindi_normalization.
Hide attributes Show attributes

version string

type string Required

Value is hunspell.

dedup boolean

If true, duplicate tokens are removed from the filter’s output. Defaults to true.

dictionary string

One or more .dic files (e.g, en_US.dic, my_custom.dic) to use for the Hunspell dictionary. By default, the hunspell filter uses all .dic files in the <$ES_PATH_CONF>/hunspell/<locale> directory specified using the lang, language, or locale parameter.

locale string Required

Locale directory used to specify the .aff and .dic files for a Hunspell dictionary.

longest_only boolean

If true, only the longest stemmed version of each token is included in the output. If false, all stemmed versions of the token are included. Defaults to false.
Hide attributes Show attributes

version string

max_subword_size number

Maximum subword character length. Longer subword tokens are excluded from the output. Defaults to 15.

min_subword_size number

Minimum subword character length. Shorter subword tokens are excluded from the output. Defaults to 2.

min_word_size number

Minimum word character length. Shorter word tokens are excluded from the output. Defaults to 5.

only_longest_match boolean

If true, only include the longest matching subword. Defaults to false.

word_list array[string]

A list of subwords to look for in the token stream. If found, the subword is included in the token output. Either this parameter or word_list_path must be specified.

word_list_path string

Path to a file that contains a list of subwords to find in the token stream. If found, the subword is included in the token output. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break. Either this parameter or word_list must be specified.

type string Required

Value is hyphenation_decompounder.

hyphenation_patterns_path string Required

Path to an Apache FOP (Formatting Objects Processor) XML hyphenation pattern file. This path must be absolute or relative to the config location. Only FOP v1.2 compatible files are supported.

no_sub_matches boolean

If true, do not match sub tokens in tokens that are in the word list. Defaults to false.

no_overlapping_matches boolean

If true, do not allow overlapping tokens. Defaults to false.
Hide attributes Show attributes

version string

type string Required

Value is indic_normalization.
Hide attributes Show attributes

version string

type string Required

Value is keep_types.

mode string

Values are include or exclude.

types array[string] Required

List of token types to keep or remove.
Hide attributes Show attributes

version string

type string Required

Value is keep.

keep_words array[string]

List of words to keep. Only tokens that match words in this list are included in the output. Either this parameter or keep_words_path must be specified.

keep_words_case boolean

If true, lowercase all keep words. Defaults to false.

keep_words_path string

Path to a file that contains a list of words to keep. Only tokens that match words in this list are included in the output. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each word in the file must be separated by a line break. Either this parameter or keep_words must be specified.
Hide attributes Show attributes

version string

type string Required

Value is keyword_marker.

ignore_case boolean

If true, matching for the keywords and keywords_path parameters ignores letter case. Defaults to false.

keywords string | array[string]

Array of keywords. Tokens that match these keywords are not stemmed. This parameter, keywords_path, or keywords_pattern must be specified. You cannot specify this parameter and keywords_pattern.

One of:
string-1 string array-2 array[string]

keywords_path string

Path to a file that contains a list of keywords. Tokens that match these keywords are not stemmed. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each word in the file must be separated by a line break. This parameter, keywords, or keywords_pattern must be specified. You cannot specify this parameter and keywords_pattern.

keywords_pattern string

Java regular expression used to match tokens. Tokens that match this expression are marked as keywords and not stemmed. This parameter, keywords, or keywords_path must be specified. You cannot specify this parameter and keywords or keywords_pattern.
Hide attributes Show attributes

version string

type string Required

Value is keyword_repeat.
Hide attributes Show attributes

version string

type string Required

Value is kstem.
Hide attributes Show attributes

version string

type string Required

Value is length.

max number

Maximum character length of a token. Longer tokens are excluded from the output. Defaults to Integer.MAX_VALUE, which is 2^31-1 or 2147483647.

min number

Minimum character length of a token. Shorter tokens are excluded from the output. Defaults to 0.
Hide attributes Show attributes

version string

type string Required

Value is limit.

consume_all_tokens boolean

If true, the limit filter exhausts the token stream, even if the max_token_count has already been reached. Defaults to false.

max_token_count number | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedinteger number Stringifiedinteger string
Hide attributes Show attributes

version string

type string Required

Value is lowercase.

language string

Values are greek, irish, or turkish.
Hide attributes Show attributes

version string

type string Required

Value is min_hash.

bucket_count number

Number of buckets to which hashes are assigned. Defaults to 512.

hash_count number

Number of ways to hash each token in the stream. Defaults to 1.

hash_set_size number

Number of hashes to keep from each bucket. Defaults to 1. Hashes are retained by ascending size, starting with the bucket’s smallest hash first.

with_rotation boolean

If true, the filter fills empty buckets with the value of the first non-empty bucket to its circular right if the hash_set_size is 1. If the bucket_count argument is greater than 1, this parameter defaults to true. Otherwise, this parameter defaults to false.
Hide attributes Show attributes

version string

type string Required

Value is multiplexer.

filters array[string] Required

A list of token filters to apply to incoming tokens.

preserve_original boolean | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedboolean boolean Stringifiedboolean string
Hide attributes Show attributes

version string

type string Required

Value is ngram.

max_gram number

Maximum length of characters in a gram. Defaults to 2.

min_gram number

Minimum length of characters in a gram. Defaults to 1.

preserve_original boolean | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedboolean boolean Stringifiedboolean string
Hide attributes Show attributes

version string

type string Required

Value is nori_part_of_speech.

stoptags array[string]

An array of part-of-speech tags that should be removed.
Hide attributes Show attributes

version string

type string Required

Value is pattern_capture.

patterns array[string] Required

A list of regular expressions to match.

preserve_original boolean | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedboolean boolean Stringifiedboolean string
Hide attributes Show attributes

version string

type string Required

Value is pattern_replace.

all boolean

If true, all substrings matching the pattern parameter’s regular expression are replaced. If false, the filter replaces only the first matching substring in each token. Defaults to true.

pattern string Required

Regular expression, written in Java’s regular expression syntax. The filter replaces token substrings matching this pattern with the substring in the replacement parameter.

replacement string

Replacement substring. Defaults to an empty substring ("").
Hide attributes Show attributes

version string

type string Required

Value is persian_normalization.
Hide attributes Show attributes

version string

type string Required

Value is porter_stem.
Hide attributes Show attributes

version string

type string Required

Value is predicate_token_filter.

script object Required

Hide script attributes Show script attributes object

source

id string

params object

Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.

lang

options object
Hide attributes Show attributes

version string

type string Required

Value is remove_duplicates.
Hide attributes Show attributes

version string

type string Required

Value is reverse.
Hide attributes Show attributes

version string

type string Required

Value is scandinavian_folding.
Hide attributes Show attributes

version string

type string Required

Value is scandinavian_normalization.
Hide attributes Show attributes

version string

type string Required

Value is serbian_normalization.
Hide attributes Show attributes

version string

type string Required

Value is shingle.

filler_token string

String used in shingles as a replacement for empty positions that do not contain a token. This filler token is only used in shingles, not original unigrams. Defaults to an underscore (_).

max_shingle_size number | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedinteger number Stringifiedinteger string

min_shingle_size number | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedinteger number Stringifiedinteger string

output_unigrams boolean

If true, the output includes the original input tokens. If false, the output only includes shingles; the original input tokens are removed. Defaults to true.

output_unigrams_if_no_shingles boolean

If true, the output includes the original input tokens only if no shingles are produced; if shingles are produced, the output only includes shingles. Defaults to false.

token_separator string

Separator used to concatenate adjacent tokens to form a shingle. Defaults to a space (" ").
Hide attributes Show attributes

version string

type string Required

Value is snowball.

language string

Values are Arabic, Armenian, Basque, Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, German2, Hungarian, Italian, Irish, Kp, Lithuanian, Lovins, Norwegian, Porter, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, or Turkish.
Hide attributes Show attributes

version string

type string Required

Value is sorani_normalization.
Hide attributes Show attributes

version string

type string Required

Value is stemmer_override.

rules array[string]

A list of mapping rules to use.

rules_path string

A path (either relative to config location, or absolute) to a list of mappings.
Hide attributes Show attributes

version string

type string Required

Value is stemmer.

language string
Hide attributes Show attributes

version string

type string Required

Value is stop.

ignore_case boolean

If true, stop word matching is case insensitive. For example, if true, a stop word of the matches and removes The, THE, or the. Defaults to false.

remove_trailing boolean

If true, the last token of a stream is removed if it’s a stop word. Defaults to true.

stopwords string | array[string]

Language value, such as arabic or thai. Defaults to english. Each language value corresponds to a predefined list of stop words in Lucene. See Stop words by language for supported language values and their stop words. Also accepts an array of stop words.

One of:
StopWordLanguage string StopWords array[string]

Values are _arabic_, _armenian_, _basque_, _bengali_, _brazilian_, _bulgarian_, _catalan_, _cjk_, _czech_, _danish_, _dutch_, _english_, _estonian_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _lithuanian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _serbian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_, or _none_.

stopwords_path string

Path to a file that contains a list of stop words to remove. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each stop word in the file must be separated by a line break.
Hide attributes Show attributes

version string

expand boolean

Expands definitions for equivalent synonym rules. Defaults to true.

format string

Values are solr or wordnet.

lenient boolean

If true ignores errors while parsing the synonym rules. It is important to note that only those synonym rules which cannot get parsed are ignored. Defaults to the value of the updateable setting.

synonyms array[string]

Used to define inline synonyms.

synonyms_path string

Used to provide a synonym file. This path must be absolute or relative to the config location.

synonyms_set string

Provide a synonym set created via Synonyms Management APIs.

tokenizer string Deprecated

Controls the tokenizers that will be used to tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.

updateable boolean

If true allows reloading search analyzers to pick up changes to synonym files. Only to be used for search analyzers. Defaults to false.

type string Required

Value is synonym_graph.
Hide attributes Show attributes

version string

expand boolean

Expands definitions for equivalent synonym rules. Defaults to true.

format string

Values are solr or wordnet.

lenient boolean

If true ignores errors while parsing the synonym rules. It is important to note that only those synonym rules which cannot get parsed are ignored. Defaults to the value of the updateable setting.

synonyms array[string]

Used to define inline synonyms.

synonyms_path string

Used to provide a synonym file. This path must be absolute or relative to the config location.

synonyms_set string

Provide a synonym set created via Synonyms Management APIs.

tokenizer string Deprecated

Controls the tokenizers that will be used to tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.

updateable boolean

If true allows reloading search analyzers to pick up changes to synonym files. Only to be used for search analyzers. Defaults to false.

type string Required

Value is synonym.
Hide attributes Show attributes

version string

type string Required

Value is trim.
Hide attributes Show attributes

version string

type string Required

Value is truncate.

length number

Character limit for each token. Tokens exceeding this limit are truncated. Defaults to 10.
Hide attributes Show attributes

version string

type string Required

Value is unique.

only_on_same_position boolean

If true, only remove duplicate tokens in the same position. Defaults to false.
Hide attributes Show attributes

version string

type string Required

Value is uppercase.
Hide attributes Show attributes

version string

catenate_all boolean

If true, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. Defaults to false.

catenate_numbers boolean

If true, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. Defaults to false.

catenate_words boolean

If true, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. Defaults to false.

generate_number_parts boolean

If true, the filter includes tokens consisting of only numeric characters in the output. If false, the filter excludes these tokens from the output. Defaults to true.

generate_word_parts boolean

If true, the filter includes tokens consisting of only alphabetical characters in the output. If false, the filter excludes these tokens from the output. Defaults to true.

preserve_original

protected_words array[string]

Array of tokens the filter won’t split.

protected_words_path string

Path to a file that contains a list of tokens the filter won’t split. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break.

split_on_case_change boolean

If true, the filter splits tokens at letter case transitions. For example: camelCase -> [ camel, Case ]. Defaults to true.

split_on_numerics boolean

If true, the filter splits tokens at letter-number transitions. For example: j2se -> [ j, 2, se ]. Defaults to true.

stem_english_possessive boolean

If true, the filter removes the English possessive ('s) from the end of each token. For example: O'Neil's -> [ O, Neil ]. Defaults to true.

type_table array[string]

Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.

type_table_path string

Path to a file that contains custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.

type string Required

Value is word_delimiter_graph.

adjust_offsets boolean

If true, the filter adjusts the offsets of split or catenated tokens to better reflect their actual position in the token stream. Defaults to true.

ignore_keywords boolean

If true, the filter skips tokens with a keyword attribute of true. Defaults to false.
Hide attributes Show attributes

version string

catenate_all boolean

If true, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. Defaults to false.

catenate_numbers boolean

If true, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. Defaults to false.

catenate_words boolean

If true, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. Defaults to false.

generate_number_parts boolean

If true, the filter includes tokens consisting of only numeric characters in the output. If false, the filter excludes these tokens from the output. Defaults to true.

generate_word_parts boolean

If true, the filter includes tokens consisting of only alphabetical characters in the output. If false, the filter excludes these tokens from the output. Defaults to true.

preserve_original

protected_words array[string]

Array of tokens the filter won’t split.

protected_words_path string

Path to a file that contains a list of tokens the filter won’t split. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break.

split_on_case_change boolean

If true, the filter splits tokens at letter case transitions. For example: camelCase -> [ camel, Case ]. Defaults to true.

split_on_numerics boolean

If true, the filter splits tokens at letter-number transitions. For example: j2se -> [ j, 2, se ]. Defaults to true.

stem_english_possessive boolean

If true, the filter removes the English possessive ('s) from the end of each token. For example: O'Neil's -> [ O, Neil ]. Defaults to true.

type_table array[string]

Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.

type_table_path string

Path to a file that contains custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.

type string Required

Value is word_delimiter.
Hide attributes Show attributes

version string

type string Required

Value is ja_stop.

stopwords string | array[string]

Language value, such as arabic or thai. Defaults to english. Each language value corresponds to a predefined list of stop words in Lucene. See Stop words by language for supported language values and their stop words. Also accepts an array of stop words.

One of:
StopWordLanguage string StopWords array[string]

Values are _arabic_, _armenian_, _basque_, _bengali_, _brazilian_, _bulgarian_, _catalan_, _cjk_, _czech_, _danish_, _dutch_, _english_, _estonian_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _lithuanian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _serbian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_, or _none_.
Hide attributes Show attributes

version string

type string Required

Value is kuromoji_stemmer.

minimum_length number Required
Hide attributes Show attributes

version string

type string Required

Value is kuromoji_readingform.

use_romaji boolean Required
Hide attributes Show attributes

version string

type string Required

Value is kuromoji_part_of_speech.

stoptags array[string] Required
Hide attributes Show attributes

version string

type string Required

Value is icu_collation.

alternate string

Values are shifted or non-ignorable.

case_first string

Values are lower or upper.

case_level boolean

country string

decomposition string

Values are no or identical.

hiragana_quaternary_mode boolean

language string

numeric boolean

rules string

strength string

Values are primary, secondary, tertiary, quaternary, or identical.

variable_top string

variant string
Hide attributes Show attributes

version string

type string Required

Value is icu_folding.

unicode_set_filter string Required
Hide attributes Show attributes

version string

type string Required

Value is icu_normalizer.

name string Required

Values are nfc, nfkc, or nfkc_cf.
Hide attributes Show attributes

version string

type string Required

Value is icu_transform.

dir string

Values are forward or reverse.

id string Required
Hide attributes Show attributes

version string

type string Required

Value is phonetic.

encoder string Required

Values are metaphone, double_metaphone, soundex, refined_soundex, caverphone1, caverphone2, cologne, nysiis, koelnerphonetik, haasephonetik, beider_morse, or daitch_mokotoff.

languageset string | array[string]

One of:
PhoneticLanguage string array-2 array[string]

Values are any, common, cyrillic, english, french, german, hebrew, hungarian, polish, romanian, russian, or spanish.

Values are any, common, cyrillic, english, french, german, hebrew, hungarian, polish, romanian, russian, or spanish.

max_code_len number

name_type string

Values are generic, ashkenazi, or sephardic.

replace boolean

rule_type string

Values are approx or exact.
Hide attributes Show attributes

version string

max_subword_size number

Maximum subword character length. Longer subword tokens are excluded from the output. Defaults to 15.

min_subword_size number

Minimum subword character length. Shorter subword tokens are excluded from the output. Defaults to 2.

min_word_size number

Minimum word character length. Shorter word tokens are excluded from the output. Defaults to 5.

only_longest_match boolean

If true, only include the longest matching subword. Defaults to false.

word_list array[string]

A list of subwords to look for in the token stream. If found, the subword is included in the token output. Either this parameter or word_list_path must be specified.

word_list_path string

Path to a file that contains a list of subwords to find in the token stream. If found, the subword is included in the token output. This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break. Either this parameter or word_list must be specified.

type string Required

Value is dictionary_decompounder.
normalizer string

Normalizer to use to convert text into a single token.
text string | array[string]

One of:
TextToAnalyze string TextToAnalyze array[string]
tokenizer string | object
One of:
Tokenizer string CharGroupTokenizer object ClassicTokenizer object EdgeNGramTokenizer object KeywordTokenizer object LetterTokenizer object LowercaseTokenizer object NGramTokenizer object PathHierarchyTokenizer object PatternTokenizer object SimplePatternTokenizer object SimplePatternSplitTokenizer object StandardTokenizer object ThaiTokenizer object UaxEmailUrlTokenizer object WhitespaceTokenizer object IcuTokenizer object KuromojiTokenizer object NoriTokenizer object

External documentation
Hide attributes Show attributes

version string

type string Required

Value is char_group.

tokenize_on_chars array[string] Required

max_token_length number
Hide attributes Show attributes

version string

type string Required

Value is classic.

max_token_length number
Hide attributes Show attributes

version string

type string Required

Value is edge_ngram.

custom_token_chars string

max_gram number

min_gram number

token_chars array[string]

Values are letter, digit, whitespace, punctuation, symbol, or custom.
Hide attributes Show attributes

version string

type string Required

Value is keyword.

buffer_size number
Hide attributes Show attributes

version string

type string Required

Value is letter.
Hide attributes Show attributes

version string

type string Required

Value is lowercase.
Hide attributes Show attributes

version string

type string Required

Value is ngram.

custom_token_chars string

max_gram number

min_gram number

token_chars array[string]

Values are letter, digit, whitespace, punctuation, symbol, or custom.
Hide attributes Show attributes

version string

type string Required

Value is path_hierarchy.

buffer_size number | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedinteger number Stringifiedinteger string

delimiter string

replacement string

reverse boolean | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedboolean boolean Stringifiedboolean string

skip number | string

Some APIs will return values such as numbers also as a string (notably epoch timestamps). This behavior is used to capture this behavior while keeping the semantics of the field type.

Depending on the target language, code generators can keep the union or remove it and leniently parse strings to the target type.

One of:
Stringifiedinteger number Stringifiedinteger string
Hide attributes Show attributes

version string

type string Required

Value is pattern.

flags string

group number

pattern string
Hide attributes Show attributes

version string

type string Required

Value is simple_pattern.

pattern string
Hide attributes Show attributes

version string

type string Required

Value is simple_pattern_split.

pattern string
Hide attributes Show attributes

version string

type string Required

Value is standard.

max_token_length number
Hide attributes Show attributes

version string

type string Required

Value is thai.
Hide attributes Show attributes

version string

type string Required

Value is uax_url_email.

max_token_length number
Hide attributes Show attributes

version string

type string Required

Value is whitespace.

max_token_length number
Hide attributes Show attributes

version string

type string Required

Value is icu_tokenizer.

rule_files string Required
Hide attributes Show attributes

version string

type string Required

Value is kuromoji_tokenizer.

discard_punctuation boolean

mode string Required

Values are normal, search, or extended.

nbest_cost number

nbest_examples string

user_dictionary string

user_dictionary_rules array[string]

discard_compound_token boolean
Hide attributes Show attributes

version string

type string Required

Value is nori_tokenizer.

decompound_mode string

Values are discard, none, or mixed.

discard_punctuation boolean

user_dictionary string

user_dictionary_rules array[string]

Responses

200 application/json
Hide response attributes Show response attributes object
- detail object
  
  Hide detail attributes Show detail attributes object
  
  analyzer object
  
  Hide analyzer attributes Show analyzer attributes object
  
  name string Required
  
  tokens array[object] Required
  
  Hide tokens attributes Show tokens attributes object
  
  bytes string Required
  
  end_offset number Required
  
  keyword boolean
  
  position number Required
  
  positionLength number Required
  
  start_offset number Required
  
  termFrequency number Required
  
  token string Required
  
  type string Required
  
  charfilters array[object]
  
  Hide charfilters attributes Show charfilters attributes object
  
  filtered_text array[string] Required
  
  name string Required
  
  custom_analyzer boolean Required
  
  tokenfilters array[object]
  
  Hide tokenfilters attributes Show tokenfilters attributes object
  
  name string Required
  
  tokens array[object] Required
  
  Hide tokens attributes Show tokens attributes object
  
  bytes string Required
  
  end_offset number Required
  
  keyword boolean
  
  position number Required
  
  positionLength number Required
  
  start_offset number Required
  
  termFrequency number Required
  
  token string Required
  
  type string Required
  
  tokenizer object
  
  Hide tokenizer attributes Show tokenizer attributes object
  
  name string Required
  
  tokens array[object] Required
  
  Hide tokens attributes Show tokens attributes object
  
  bytes string Required
  
  end_offset number Required
  
  keyword boolean
  
  position number Required
  
  positionLength number Required
  
  start_offset number Required
  
  termFrequency number Required
  
  token string Required
  
  type string Required
- tokens array[object]
  
  Hide tokens attributes Show tokens attributes object
  
  end_offset number Required
  
  position number Required
  
  positionLength number
  
  start_offset number Required
  
  token string Required
  
  type string Required

POST /{index}/_analyze

GET /_analyze
{
  "tokenizer": "standard",
  "filter" : [
    "lowercase",
    {
      "type": "synonym_graph",
      "synonyms": ["pc => personal computer", "computer, pc, laptop"]
    }
  ],
  "text" : "Check how PC synonyms work"
}

curl \
 --request POST 'http://api.example.com/{index}/_analyze' \
 --header "Content-Type: application/json" \
 --data '"{\n  \"tokenizer\": \"standard\",\n  \"filter\" : [\n    \"lowercase\",\n    {\n      \"type\": \"synonym_graph\",\n      \"synonyms\": [\"pc =\u003e personal computer\", \"computer, pc, laptop\"]\n    }\n  ],\n  \"text\" : \"Check how PC synonyms work\"\n}"'

Request examples

An example body for a `GET /_analyze` request.

{
  "tokenizer": "standard",
  "filter" : [
    "lowercase",
    {
      "type": "synonym_graph",
      "synonyms": ["pc => personal computer", "computer, pc, laptop"]
    }
  ],
  "text" : "Check how PC synonyms work"
}

You can apply any of the built-in analyzers to the text string without specifying an index.

{
  "analyzer": "standard",
  "text": "this is a test"
}

If the text parameter is provided as array of strings, it is analyzed as a multi-value field.

{
  "analyzer": "standard",
  "text": [
    "this is a test",
    "the second text"
  ]
}

You can test a custom transient analyzer built from tokenizers, token filters, and char filters. Token filters use the filter parameter.

{
  "tokenizer": "keyword",
  "filter": [
    "lowercase"
  ],
  "char_filter": [
    "html_strip"
  ],
  "text": "this is a <b>test</b>"
}

Custom tokenizers, token filters, and character filters can be specified in the request body.

{
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    {
      "type": "stop",
      "stopwords": [
        "a",
        "is",
        "this"
      ]
    }
  ],
  "text": "this is a test"
}

Run `GET /analyze_sample/_analyze` to run an analysis on the text using the default index analyzer associated with the `analyze_sample` index. Alternatively, the analyzer can be derived based on a field mapping.

{
  "field": "obj1.field1",
  "text": "this is a test"
}

Run `GET /analyze_sample/_analyze` and supply a normalizer for a keyword field if there is a normalizer associated with the specified index.

{
  "normalizer": "my_normalizer",
  "text": "BaR"
}

If you want to get more advanced details, set `explain` to `true`. It will output all token attributes for each token. You can filter token attributes you want to output by setting the `attributes` option. NOTE: The format of the additional detail information is labelled as experimental in Lucene and it may change in the future.

{
  "tokenizer": "standard",
  "filter": [
    "snowball"
  ],
  "text": "detailed output",
  "explain": true,
  "attributes": [
    "keyword"
  ]
}

Response examples (200)

A successful response for an analysis with `explain` set to `true`.

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "standard",
      "tokens": [
        {
          "token": "detailed",
          "start_offset": 0,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "output",
          "start_offset": 9,
          "end_offset": 15,
          "type": "<ALPHANUM>",
          "position": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "snowball",
        "tokens": [
          {
            "token": "detail",
            "start_offset": 0,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 0,
            "keyword": false
          },
          {
            "token": "output",
            "start_offset": 9,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 1,
            "keyword": false
          }
        ]
      }
    ]
  }
}

Documentation preview

This is a preview of your version @2025-06-09 which is not yet released.

View current documentation