This documentation contains work-in-progress information for future Elastic Stack and Cloud releases. Use the version selector to view supported release docs. It also contains some Elastic Cloud serverless information. Check out our serverless docs for more details.
Character group tokenizer
editCharacter group tokenizeredit
The char_group
tokenizer breaks text into terms whenever it encounters a
character which is in a defined set. It is mostly useful for cases where a simple
custom tokenization is desired, and the overhead of use of the pattern
tokenizer
is not acceptable.
Configurationedit
The char_group
tokenizer accepts one parameter:
|
A list containing a list of characters to tokenize the string on. Whenever a character
from this list is encountered, a new token is started. This accepts either single
characters like e.g. |
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
Example outputedit
response = client.indices.analyze( body: { tokenizer: { type: 'char_group', tokenize_on_chars: [ 'whitespace', '-', "\n" ] }, text: 'The QUICK brown-fox' } ) puts response
POST _analyze { "tokenizer": { "type": "char_group", "tokenize_on_chars": [ "whitespace", "-", "\n" ] }, "text": "The QUICK brown-fox" }
returns
{ "tokens": [ { "token": "The", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "QUICK", "start_offset": 4, "end_offset": 9, "type": "word", "position": 1 }, { "token": "brown", "start_offset": 10, "end_offset": 15, "type": "word", "position": 2 }, { "token": "fox", "start_offset": 16, "end_offset": 19, "type": "word", "position": 3 } ] }