Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
At last! It's possible to search for multitoken synonyms in #Elasticsearch and #Lucene, and get the correct hits: https://t.co/Qh5xF2TKSs
— elastic (@elastic) February 14, 2017
Query_string to generate phrase queries for multi-term synonyms
Query parsers are getting a new option that allows them to produce phrase queries when multi-term synonyms are encountered. For instance say your analysis chain configures ny
and new york
as synonyms and a user searches for ny city
, then the produced query will now be ((ny OR "new york") AND city)
instead of ((ny OR (new AND york) AND city)
. This is interesting as it means that this query would no longer match documents talking about something new that happened at the York City FC for instance.
Safe regex tokenizer
Regular expressions are powerful for tokenizing text, by identifying which sequence of characters define a token, or reversing that and identifying those characters that should split tokens. Lucene's PatternTokenizer offers this functionality, using the JDK's builtin regular expressions (java.util.regex.*). Unfortunately, since the JDK uses non-deterministic finite state automata (NFAs), they are vulnerable to nasty adversarial cases that can sometimes take exponentially long to run on certain text , and our users have hit this. So for Lucene 6.5.0 we've added a new SimplePatternTokenizer, using Lucene's regular expressions and deterministic finite state automata instead. This means overly complex regular expressions will fail to determinize, and will be detected up front, instead of later on with an unlucky document text, and once the DFA is successfully compiled, tokenization is extremely fast. Unfortunately, since Lucene's DFAs do not support capture groups, we can't yet offer a similar version for PatternCaptureGroupTokenFilter . Fortunately, there is this nice paper describing tagged NFAs that looks like a relatively straightforward approach to make capture groups work with Lucene: patches welcome!
Changes in 5.3:- The
cgroups
stats feature (for measuring stats per container) had a regex bug which could prevent a node from starting up. - Bad Java versions now result in a much more useful exception than before.
- Elasticsearch and all plugins now generate a NOTICE file at build time containing licenses from all dependencies.
- Search-scroll requests can accept a plain text body.
- Fields with
default
similarity in 2.x were upgraded to useclassic
similarity, even when thedefault
similarity had been overridden. - Malformed HTTP requests are now caught and handled early instead of Netty's default behaviour: forwarding to the
/bad-request
end point. - The undocumented
include.pattern
parameter in aggs was still being used in Kibana, so support has been re-added with deprecation. - The
Location:
header returned by PUT requests was not properly encoded, which resulted in an exception if (eg) a doc ID contained a space.
- Work on the Java High Level REST client continues with the index API, plus parsing of suggester entry responses, completion suggestion option, and bulk responses.
- HEAD requests should return a content-length header that accurately reflects the length of the omitted response.
- Field collapsing now takes advantage of the more flexible search phases to add its own collapse phase instead of relying on the fetch phase.
- The
IndexOrDocValuesQuery
will automatically choose the most efficient way to execute a range query, based on the relative costs of the queries involved in a conjunction. - Nested queries should avoid adding unnecessary filters when possible.
- The content-type header must be present and valid for all HTTP requests with a body.
- Some file systems are so large that they can overflow a
long
value.
- Synonyms should be parsed with the same analysis chain as other tokens.
- Batched reduction of search and agg results could allow the removal of the 1,000 shard soft limit.
Apache Lucene
- The Lucene 5.5.4 bug fix release is out, fixing two memory leak issues and a number of other accumulated bugs
- If we deprecated index time boosts then length normalization factors, taking one byte per field X document by default, could be more accurate
- The near-real-time document suggester should allow for optionally filtering out duplicates, requiring tricky changes to the classes that enumerate top scoring paths through an FST
PatternReplaceCharFilterFactor y
now implements theMultiTermAware
marker interface so prefix and wildcard queries can work with itRangeField
gets anewCrossesQuery
to find all documents intersecting a strict subset of the query rangeCommonGramsQueryFilter,
which removes unigram tokens when bigram tokens are present, no longer works at query time because it creates a disconnected graph which confuses our new query parser logic to handle token stream graphs- Our maven integration needs some improvement to handle inter-module dependencies in Solr
- Should we try to initialize
ArrayList
with their expected size in general? - The
ComplexPhraseQueryParse
do es not know how to handleSynonymQuery
IndexSearcher
should take advantage of sorted indices by default, but it requires some API changes like not returning a total hit count nor maximum score for queries by default- Forbidden APIs is upgraded to 2.3
- Building massive boolean queries is still slow
OneMergeWrappingMergePolicy
le ts you change each merge the merge policy chose before it's executed- Managing our GnuPG keys used for signing release bits is challenging
- Now that
ToParentBlockJoinCollecto r
is gone we can again try to make it easier to get the per-hit matching scorers inDisjunctionScorer
- It is far too easy to accidentally create an index that messes up block joins, but such problems should be detected at index time, not with further query-time checks
SimplePatternTokenizer
andSim plePatternSplitTokenizer
use Lucene's fast determinized automaton implementation to locate tokens, but require that the regular expression can be compactly determinized (not all can!)- It's vital that you use the correct
random
instance in tests! - Lucene has a number of ancient release artifacts that we should prune
- When index files go missing we should throw a
CorruptIndexException,
not aNoSuchFileException
, but this led to some fun test failures in our exception stress tests MemoryIndex
is being modernized: it now respectsomitNorms
, directly implements the new doc values iterators and properly implement per-field postings payloads- Term filters are so cheap that we should never cache them, leaving space for more complex queries, and we should also cache compound filters earlier than their sub-clauses to improve cache efficiency
- Block join queries should not try to track their original un-rewritten forms
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!