14 février 2017

Multi-Token Synonyms and Graph Queries in Elasticsearch

Long ago I described how the tokens Lucene produces during text analysis really describe a graph which is a good fit for the complex structure of the natural languages we humans speak and write. For example, when you apply the synonym:

  domain name system → dns

while analyzing this text:

  domain name system is fragile

the resulting token graph from Lucene, as encoded using PositionIncrementAttribute and PositionLengthAttribute attributes, looks like this:

However, once you add this document to a Lucene index, some of the graph structure is lost because Lucene ignores the PositionLengthAttribute, which states where a given token should end, and effectively flattens your token graph to this:

Of course, this makes some queries buggy. For example, the phrase query "dns is fragile" will fail to match this document yet it should. Similarly, the phrase query "dns name system" should not match this document, but will!

Worse, if your synonym had inserted more than one token, such as reversing the above collapsing rule into an expanding one:

  dns → domain name system

and then analyzed this text:

  dns is fragile

then even the tokens produced by SynonymFilter would already be flattened, before indexing!

This has been a longstanding, complex, glaring and embarrassing limitation of synonyms with Lucene, resulting in multiple Jira issues, long discussions on Lucene's users list and multiple complex blog posts describing hairy, partial workarounds.

The Solution

Finally, as of Lucene 6.4.0, included in Elasticsearch 5.2.0, we have a clean solution, as long as you apply synonyms at search-time instead of index-time.

The first big change is the addition of SynonymGraphFilter, replacing the now-deprecated SynonymFilter. This new synonym filter produces a correct graph in all cases (except for any exciting bugs!), whether your inserted synonym is a single token or multiple tokens. Along with this we've also added a new FlattenGraphFilter, to squash any graph token stream for indexing. If for some reason you really need exactly the same behavior as the old SynonymFilter then you should create a SynonymGraphFilter followed by a FlattenGraphFilter, but just remember that FlattenGraphFilter necessarily throws something (the full graph structure) away!

The second set of vital improvements is to the query parsers. First, the classic query parser had to stop pre-splitting incoming query text at whitespace, fixed (after years of controversy!) in Lucene 6.2.0. Be sure to call setSplitOnWhitespace(false) since the whitespace splitting is still enabled by default to preserve backwards compatibility. This change empowers the query-time analyzer to see multiple tokens as a single string instead of seeing each token separately. These simplifications to the complex logic in QueryBuilder are also an important precursor.

The third query parser fix is to detect when the query-time analyzer produced a graph, and create accurate queries as a result, also first added in Lucene 6.4.0. The query parser (specifically the QueryBuilder base class) now watches the PositionLengthAttribute and computes all paths through the graph when any token has a value greater than 1.

In Elasticsearch, the match_query, query_string and simple_query_string queries will all correctly handle a token graph.

Multiple Paths

These fixes, combined, enable you to apply multi-token synonyms at search-time and achieve accurate results. For example, if the query is:

  dns is fragile

with no quotes, and your SynonymGraphFilter expands dns to also search for domain name system, then the query parser will first build up the whole token graph after analyzing your query, and then enumerate the separate paths through the graph:

  dns is fragile
  domain name system is fragile

and will then separately analyze those two strings to produce sub-queries, taking their union using SHOULD clauses in a BooleanQuery.

If instead the original query had double quotes, then this will be the union of two PhraseQuery, yielding accurate hits!

A Tricky Optimization

For Lucene 6.5.0, this important QueryBuilder optimization will analyze the query's token graph for articulation points, or cut vertices, in order to create a more efficient BooleanQuery directly, avoiding possibly dangerous combinatoric explosion of all paths as the graph gets larger.

While this is mostly just an optimization, it does raise some tricky questions about how graph queries should be scored, how boolean operator should apply to the expanded synonyms, and how settings like minShouldMatch should work.

For example, if the user did not put quotes around the query, but a multi-token synonym is inserted, should that synonym use a phrase query or just the default query parser operator? Hopefully real-world experiences with graph queries over time will shed some light on these questions.

More graph filters

Besides synonyms, Lucene has other token filters that should produce graphs!

In the next (6.5.0) Lucene release, WordDelimiterFilter will be deprecated and replaced with WordDelimiterGraphFilter resulting in a correct token graph for input tokens like PowerShot2000-15:

Lucene's JapaneseTokenizer, based on the Kuromoji Japanese morphological analyzer, already produces a token graph to express a whole word at the same time as possible sub-words. Numerous other token filters, such as shingles, ngrams and de-compounding token filters, should produce a graph output, but they have not been fixed yet. Patches welcome!

Limitations

To make multi-token synonyms work correctly you must apply your synonyms at query time, not index-time, since a Lucene index cannot store a token graph. Search-time synonyms necessarily require more work (CPU and IO) than index-time synonyms, since more terms must be visited to answer the query, but the index will be smaller. Search-time synonyms are also more flexible, since you don't need to re-index when you change your synonyms, which is important for very large indices.

Another challenge is that SynonymGraphFilter, while producing correct graphs, cannot consume a graph, which means you cannot for example use WordDelimiterGraphFilter or JapaneseTokenizer followed by SynonymGraphFilter and expect synonyms to match the incoming graph fragments.

This is quite tricky to fix given the current TokenStream APIs, and there is a compelling experimental branch to explore a simpler API for creating graph token streams. While the synonym filter on that branch can already consume a graph, there is still plenty of work to be done before the change can be merged (patches welcome!).