This Week in Elasticsearch and Apache Lucene - 2017-02-20
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Query parsers are getting a new option that allows them to produce phrase queries when multi-term synonyms are encountered. For instance say your analysis chain configures
new york as synonyms and a user searches for
ny city, then the produced query will now be
((ny OR "new york") AND city) instead of
((ny OR (new AND york) AND city). This is interesting as it means that this query would no longer match documents talking about something new that happened at the York City FC for instance.
Safe regex tokenizer
Regular expressions are powerful for tokenizing text, by identifying which sequence of characters define a token, or reversing that and identifying those characters that should split tokens. Lucene's PatternTokenizer offers this functionality, using the JDK's builtin regular expressions (java.util.regex.*). Unfortunately, since the JDK uses non-deterministic finite state automata (NFAs), they are vulnerable to nasty adversarial cases that can sometimes take exponentially long to run on certain text , and our users have hit this. So for Lucene 6.5.0 we've added a new SimplePatternTokenizer, using Lucene's regular expressions and deterministic finite state automata instead. This means overly complex regular expressions will fail to determinize, and will be detected up front, instead of later on with an unlucky document text, and once the DFA is successfully compiled, tokenization is extremely fast. Unfortunately, since Lucene's DFAs do not support capture groups, we can't yet offer a similar version for PatternCaptureGroupTokenFilter . Fortunately, there is this nice paper describing tagged NFAs that looks like a relatively straightforward approach to make capture groups work with Lucene: patches welcome!Changes in 5.3:
cgroupsstats feature (for measuring stats per container) had a regex bug which could prevent a node from starting up.
- Bad Java versions now result in a much more useful exception than before.
- Elasticsearch and all plugins now generate a NOTICE file at build time containing licenses from all dependencies.
- Search-scroll requests can accept a plain text body.
- Fields with
defaultsimilarity in 2.x were upgraded to use
classicsimilarity, even when the
defaultsimilarity had been overridden.
- Malformed HTTP requests are now caught and handled early instead of Netty's default behaviour: forwarding to the
- The undocumented
include.patternparameter in aggs was still being used in Kibana, so support has been re-added with deprecation.
Location:header returned by PUT requests was not properly encoded, which resulted in an exception if (eg) a doc ID contained a space.
- Work on the Java High Level REST client continues with the index API, plus parsing of suggester entry responses, completion suggestion option, and bulk responses.
- HEAD requests should return a content-length header that accurately reflects the length of the omitted response.
- Field collapsing now takes advantage of the more flexible search phases to add its own collapse phase instead of relying on the fetch phase.
IndexOrDocValuesQuerywill automatically choose the most efficient way to execute a range query, based on the relative costs of the queries involved in a conjunction.
- Nested queries should avoid adding unnecessary filters when possible.
- The content-type header must be present and valid for all HTTP requests with a body.
- Some file systems are so large that they can overflow a
- Synonyms should be parsed with the same analysis chain as other tokens.
- Batched reduction of search and agg results could allow the removal of the 1,000 shard soft limit.
- The Lucene 5.5.4 bug fix release is out, fixing two memory leak issues and a number of other accumulated bugs
- If we deprecated index time boosts then length normalization factors, taking one byte per field X document by default, could be more accurate
- The near-real-time document suggester should allow for optionally filtering out duplicates, requiring tricky changes to the classes that enumerate top scoring paths through an FST
PatternReplaceCharFilterFactor ynow implements the
MultiTermAwaremarker interface so prefix and wildcard queries can work with it
newCrossesQueryto find all documents intersecting a strict subset of the query range
CommonGramsQueryFilter,which removes unigram tokens when bigram tokens are present, no longer works at query time because it creates a disconnected graph which confuses our new query parser logic to handle token stream graphs
- Our maven integration needs some improvement to handle inter-module dependencies in Solr
- Should we try to initialize
ArrayListwith their expected size in general?
ComplexPhraseQueryParsedo es not know how to handle
IndexSearchershould take advantage of sorted indices by default, but it requires some API changes like not returning a total hit count nor maximum score for queries by default
- Forbidden APIs is upgraded to 2.3
- Building massive boolean queries is still slow
OneMergeWrappingMergePolicyle ts you change each merge the merge policy chose before it's executed
- Managing our GnuPG keys used for signing release bits is challenging
- Now that
ToParentBlockJoinCollecto ris gone we can again try to make it easier to get the per-hit matching scorers in
- It is far too easy to accidentally create an index that messes up block joins, but such problems should be detected at index time, not with further query-time checks
Sim plePatternSplitTokenizeruse Lucene's fast determinized automaton implementation to locate tokens, but require that the regular expression can be compactly determinized (not all can!)
- It's vital that you use the correct
randominstance in tests!
- Lucene has a number of ancient release artifacts that we should prune
- When index files go missing we should throw a
NoSuchFileException, but this led to some fun test failures in our exception stress tests
MemoryIndexis being modernized: it now respects
omitNorms, directly implements the new doc values iterators and properly implement per-field postings payloads
- Term filters are so cheap that we should never cache them, leaving space for more complex queries, and we should also cache compound filters earlier than their sub-clauses to improve cache efficiency
- Block join queries should not try to track their original un-rewritten forms
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!