﻿---
title: Tokenizer reference
description: A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance,...
url: https://www.elastic.co/docs/reference/text-analysis/tokenizer-reference
products:
  - Elasticsearch
---

# Tokenizer reference
<admonition title="Difference between Elasticsearch tokenization and neural tokenization">
  Elasticsearch's tokenization process produces linguistic tokens, optimized for search and retrieval. This differs from neural tokenization in the context of machine learning and natural language processing. Neural tokenizers translate strings into smaller, subword tokens, which are encoded into vectors for consumption by neural networks. Elasticsearch does not have built-in neural tokenizers.
</admonition>

A *tokenizer* receives a stream of characters, breaks it up into individual *tokens* (usually individual words), and outputs a stream of *tokens*. For instance, a [`whitespace`](https://www.elastic.co/docs/reference/text-analysis/analysis-whitespace-tokenizer) tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text `"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
The tokenizer is also responsible for recording the following:
- Order or *position* of each term (used for phrase and word proximity queries)
- Start and end *character offsets* of the original word which the term represents (used for highlighting search snippets).
- *Token type*, a classification of each term produced, such as `<ALPHANUM>`, `<HANGUL>`, or `<NUM>`. Simpler analyzers only produce the `word` token type.

Elasticsearch has a number of built in tokenizers which can be used to build [custom analyzers](https://www.elastic.co/docs/manage-data/data-store/text-analysis/create-custom-analyzer).

## Word Oriented Tokenizers

The following tokenizers are usually used for tokenizing full text into individual words:
<definitions>
  <definition term="Standard Tokenizer">
    The `standard` tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.
  </definition>
  <definition term="Letter Tokenizer">
    The `letter` tokenizer divides text into terms whenever it encounters a character which is not a letter.
  </definition>
  <definition term="Lowercase Tokenizer">
    The `lowercase` tokenizer, like the `letter` tokenizer,  divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.
  </definition>
  <definition term="Whitespace Tokenizer">
    The `whitespace` tokenizer divides text into terms whenever it encounters any whitespace character.
  </definition>
  <definition term="UAX URL Email Tokenizer">
    The `uax_url_email` tokenizer is like the `standard` tokenizer except that it recognises URLs and email addresses as single tokens.
  </definition>
  <definition term="Classic Tokenizer">
    The `classic` tokenizer is a grammar based tokenizer for the English Language.
  </definition>
  <definition term="Thai Tokenizer">
    The `thai` tokenizer segments Thai text into words.
  </definition>
</definitions>


## Partial Word Tokenizers

These tokenizers break up text or words into small fragments, for partial word matching:
<definitions>
  <definition term="N-Gram Tokenizer">
    The `ngram` tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g. `quick` → `[qu, ui, ic, ck]`.
  </definition>
  <definition term="Edge N-Gram Tokenizer">
    The `edge_ngram` tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g. `quick` → `[q, qu, qui, quic, quick]`.
  </definition>
</definitions>


## Structured Text Tokenizers

The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text:
<definitions>
  <definition term="Keyword Tokenizer">
    The `keyword` tokenizer is a noop tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like [`lowercase`](https://www.elastic.co/docs/reference/text-analysis/analysis-lowercase-tokenfilter) to normalise the analysed terms.
  </definition>
  <definition term="Pattern Tokenizer">
    The `pattern` tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.
  </definition>
  <definition term="Simple Pattern Tokenizer">
    The `simple_pattern` tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the `pattern` tokenizer.
  </definition>
  <definition term="Char Group Tokenizer">
    The `char_group` tokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions.
  </definition>
  <definition term="Simple Pattern Split Tokenizer">
    The `simple_pattern_split` tokenizer uses the same restricted regular expression subset as the `simple_pattern` tokenizer, but splits the input at matches rather than returning the matches as terms.
  </definition>
  <definition term="Path Tokenizer">
    The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. `/foo/bar/baz` → `[/foo, /foo/bar, /foo/bar/baz ]`.
  </definition>
</definitions>