A tokenizer of type
pattern that can flexibly separate text into terms
via a regular expression. Accepts the following settings:
The regular expression pattern, defaults to
The regular expression flags.
Which group to extract into tokens. Defaults to
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
group set to
-1 (the default) is equivalent to "split". Using group
>= 0 selects the matching group as the token. For example, if you have:
pattern = '([^']+)' group = 0 input = aaa 'bbb' 'ccc'
the output will be two tokens:
'ccc' (including the
marks). With the same input but using group=1, the output would be: