This Week in Elasticsearch and Apache Lucene - 2017-02-20

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

At last! It's possible to search for multitoken synonyms in #Elasticsearch and #Lucene, and get the correct hits: https://t.co/Qh5xF2TKSs

— elastic (@elastic) February 14, 2017

Query_string to generate phrase queries for multi-term synonyms

Query parsers are getting a new option that allows them to produce phrase queries when multi-term synonyms are encountered. For instance say your analysis chain configures ny and new york as synonyms and a user searches for ny city, then the produced query will now be ((ny OR "new york") AND city) instead of ((ny OR (new AND york) AND city). This is interesting as it means that this query would no longer match documents talking about something new that happened at the York City FC for instance.

Safe regex tokenizer

Regular expressions are powerful for tokenizing text, by identifying which sequence of characters define a token, or reversing that and identifying those characters that should split tokens. Lucene's PatternTokenizer offers this functionality, using the JDK's builtin regular expressions (java.util.regex.*). Unfortunately, since the JDK uses non-deterministic finite state automata (NFAs), they are vulnerable to nasty adversarial cases that can sometimes take exponentially long to run on certain text , and our users have hit this. So for Lucene 6.5.0 we've added a new SimplePatternTokenizer, using Lucene's regular expressions and deterministic finite state automata instead. This means overly complex regular expressions will fail to determinize, and will be detected up front, instead of later on with an unlucky document text, and once the DFA is successfully compiled, tokenization is extremely fast. Unfortunately, since Lucene's DFAs do not support capture groups, we can't yet offer a similar version for PatternCaptureGroupTokenFilter . Fortunately, there is this nice paper describing tagged NFAs that looks like a relatively straightforward approach to make capture groups work with Lucene: patches welcome!

Changes in 5.3:Changes in 5.x:Changes in master:Upcoming changes:

    Apache Lucene

    Watch This Space

    Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!