20 February 2017

This Week in Elasticsearch and Apache Lucene - 2017-02-20

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

At last! It's possible to search for multitoken synonyms in #Elasticsearch and #Lucene, and get the correct hits: https://t.co/Qh5xF2TKSs
— elastic (@elastic) February 14, 2017

Query_string to generate phrase queries for multi-term synonyms

Query parsers are getting a new option that allows them to produce phrase queries when multi-term synonyms are encountered. For instance say your analysis chain configures ny and new york as synonyms and a user searches for ny city, then the produced query will now be ((ny OR "new york") AND city) instead of ((ny OR (new AND york) AND city). This is interesting as it means that this query would no longer match documents talking about something new that happened at the York City FC for instance.

Safe regex tokenizer

Regular expressions are powerful for tokenizing text, by identifying which sequence of characters define a token, or reversing that and identifying those characters that should split tokens. Lucene's PatternTokenizer offers this functionality, using the JDK's builtin regular expressions (java.util.regex.*). Unfortunately, since the JDK uses non-deterministic finite state automata (NFAs), they are vulnerable to nasty adversarial cases that can sometimes take exponentially long to run on certain text , and our users have hit this. So for Lucene 6.5.0 we've added a new SimplePatternTokenizer, using Lucene's regular expressions and deterministic finite state automata instead. This means overly complex regular expressions will fail to determinize, and will be detected up front, instead of later on with an unlucky document text, and once the DFA is successfully compiled, tokenization is extremely fast. Unfortunately, since Lucene's DFAs do not support capture groups, we can't yet offer a similar version for PatternCaptureGroupTokenFilter . Fortunately, there is this nice paper describing tagged NFAs that looks like a relatively straightforward approach to make capture groups work with Lucene: patches welcome!

Changes in 5.3:

The cgroups stats feature (for measuring stats per container) had a regex bug which could prevent a node from starting up.
Bad Java versions now result in a much more useful exception than before.
Elasticsearch and all plugins now generate a NOTICE file at build time containing licenses from all dependencies.
Search-scroll requests can accept a plain text body.
Fields with default similarity in 2.x were upgraded to use classic similarity, even when the default similarity had been overridden.
Malformed HTTP requests are now caught and handled early instead of Netty's default behaviour: forwarding to the /bad-request end point.
The undocumented include.pattern parameter in aggs was still being used in Kibana, so support has been re-added with deprecation.
The Location: header returned by PUT requests was not properly encoded, which resulted in an exception if (eg) a doc ID contained a space.

Changes in 5.x:

Work on the Java High Level REST client continues with the index API, plus parsing of suggester entry responses, completion suggestion option, and bulk responses.
HEAD requests should return a content-length header that accurately reflects the length of the omitted response.
Field collapsing now takes advantage of the more flexible search phases to add its own collapse phase instead of relying on the fetch phase.
The IndexOrDocValuesQuery will automatically choose the most efficient way to execute a range query, based on the relative costs of the queries involved in a conjunction.
Nested queries should avoid adding unnecessary filters when possible.

Changes in master:

The content-type header must be present and valid for all HTTP requests with a body.
Some file systems are so large that they can overflow a long value.

Upcoming changes:

Synonyms should be parsed with the same analysis chain as other tokens.
Batched reduction of search and agg results could allow the removal of the 1,000 shard soft limit.

Apache Lucene

The Lucene 5.5.4 bug fix release is out, fixing two memory leak issues and a number of other accumulated bugs
If we deprecated index time boosts then length normalization factors, taking one byte per field X document by default, could be more accurate
The near-real-time document suggester should allow for optionally filtering out duplicates, requiring tricky changes to the classes that enumerate top scoring paths through an FST
PatternReplaceCharFilterFactor y now implements the MultiTermAware marker interface so prefix and wildcard queries can work with it
RangeField gets a newCrossesQuery to find all documents intersecting a strict subset of the query range
CommonGramsQueryFilter, which removes unigram tokens when bigram tokens are present, no longer works at query time because it creates a disconnected graph which confuses our new query parser logic to handle token stream graphs
Our maven integration needs some improvement to handle inter-module dependencies in Solr
Should we try to initialize ArrayList with their expected size in general?
The ComplexPhraseQueryParse do es not know how to handle SynonymQuery
IndexSearcher should take advantage of sorted indices by default, but it requires some API changes like not returning a total hit count nor maximum score for queries by default
Forbidden APIs is upgraded to 2.3
Building massive boolean queries is still slow
OneMergeWrappingMergePolicy le ts you change each merge the merge policy chose before it's executed
Managing our GnuPG keys used for signing release bits is challenging
Now that ToParentBlockJoinCollecto r is gone we can again try to make it easier to get the per-hit matching scorers in DisjunctionScorer
It is far too easy to accidentally create an index that messes up block joins, but such problems should be detected at index time, not with further query-time checks
SimplePatternTokenizer and Sim plePatternSplitTokenizer use Lucene's fast determinized automaton implementation to locate tokens, but require that the regular expression can be compactly determinized (not all can!)
It's vital that you use the correct random instance in tests!
Lucene has a number of ancient release artifacts that we should prune
When index files go missing we should throw a CorruptIndexException, not a NoSuchFileException , but this led to some fun test failures in our exception stress tests
MemoryIndex is being modernized: it now respects omitNorms , directly implements the new doc values iterators and properly implement per-field postings payloads
Term filters are so cheap that we should never cache them, leaving space for more complex queries, and we should also cache compound filters earlier than their sub-clauses to improve cache efficiency
Block join queries should not try to track their original un-rewritten forms

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2017-02-20

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS