Word Delimiter Token Filteredit

Named word_delimiter, it Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

  • split on intra-word delimiters (by default, all non alpha-numeric characters).
  • "Wi-Fi" → "Wi", "Fi"
  • split on case transitions: "PowerShot" → "Power", "Shot"
  • split on letter-number transitions: "SD500" → "SD", "500"
  • leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, dude" → "hello", "there", "dude"
  • trailing "'s" are removed for each subword: "O’Neil’s" → "O", "Neil"

Parameters include:

If true causes parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults to true.
If true causes number subwords to be generated: "500-42" ⇒ "500" "42". Defaults to true.
If true causes maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults to false.
If true causes maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults to false.
If true causes all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults to false.
If true causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards). Defaults to true.
If true includes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults to false.
If true causes "j2se" to be three tokens; "j" "2" "se". Defaults to true.
If true causes trailing "'s" to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults to true.

Advance settings include:

A list of protected words from being delimiter. Either an array, or also can set protected_words_path which resolved to a file configured with protected words (one on each line). Automatically resolves to config/ based location if exists.
A custom type mapping table, for example (when configured using type_table_path):
    # Map the $, %, '.', and ',' characters to DIGIT
    # This might be useful for financial data.
    $ => DIGIT
    % => DIGIT
    . => DIGIT
    \\u002C => DIGIT

    # in some cases you might not want to split on ZWJ
    # this also tests the case where we need a bigger byte[]
    # see http://en.wikipedia.org/wiki/Zero-width_joiner
    \\u200D => ALPHANUM

Using a tokenizer like the standard tokenizer may interfere with the catenate_* and preserve_original parameters, as the original string may already have lost punctuation during tokenization. Instead, you may want to use the whitespace tokenizer.