Long ago I described how the tokens Lucene produces during text analysis really describe a graph which is a good fit for the complex structure of the natural languages we humans speak and write. For example, when you apply the synonym:
domain name system → dns
while analyzing this text:
domain name system is fragile
However, once you add this document to a Lucene index, some of the graph structure is lost because Lucene ignores the
PositionLengthAttribute, which states where a given token should end, and effectively flattens your token graph to this:
Of course, this makes some queries buggy. For example, the phrase query
"dns is fragile" will fail to match this document yet it should. Similarly, the phrase query
"dns name system" should not match this document, but will!
Worse, if your synonym had inserted more than one token, such as reversing the above collapsing rule into an expanding one:
dns → domain name system
and then analyzed this text:
dns is fragile
then even the tokens produced by
SynonymFilter would already be flattened, before indexing!
This has been a longstanding, complex, glaring and embarrassing limitation of synonyms with Lucene, resulting in multiple Jira issues, long discussions on Lucene's users list and multiple complex blog posts describing hairy, partial workarounds.
The first big change is the addition of
SynonymGraphFilter, replacing the now-deprecated
SynonymFilter. This new synonym filter produces a correct graph in all cases (except for any exciting bugs!), whether your inserted synonym is a single token or multiple tokens. Along with this we've also added a new
FlattenGraphFilter, to squash any graph token stream for indexing. If for some reason you really need exactly the same behavior as the old
SynonymFilter then you should create a
SynonymGraphFilter followed by a
FlattenGraphFilter, but just remember that
FlattenGraphFilter necessarily throws something (the full graph structure) away!
The second set of vital improvements is to the query parsers. First, the classic query parser had to stop pre-splitting incoming query text at whitespace, fixed (after years of controversy!) in Lucene 6.2.0. Be sure to call
setSplitOnWhitespace(false) since the whitespace splitting is still enabled by default to preserve backwards compatibility. This change empowers the query-time analyzer to see multiple tokens as a single string instead of seeing each token separately. These simplifications to the complex logic in
QueryBuilder are also an important precursor.
The third query parser fix is to detect when the query-time analyzer produced a graph, and create accurate queries as a result, also first added in Lucene 6.4.0. The query parser (specifically the
QueryBuilder base class) now watches the
PositionLengthAttribute and computes all paths through the graph when any token has a value greater than 1.
In Elasticsearch, the
simple_query_string queries will all correctly handle a token graph.
These fixes, combined, enable you to apply multi-token synonyms at search-time and achieve accurate results. For example, if the query is:
dns is fragile
with no quotes, and your
dns to also search for
domain name system, then the query parser will first build up the whole token graph after analyzing your query, and then enumerate the separate paths through the graph:
dns is fragile domain name system is fragile
and will then separately analyze those two strings to produce sub-queries, taking their union using
SHOULD clauses in a
If instead the original query had double quotes, then this will be the union of two
PhraseQuery, yielding accurate hits!
A Tricky Optimization
For Lucene 6.5.0, this important
QueryBuilder optimization will analyze the query's token graph for articulation points, or cut vertices, in order to create a more efficient
BooleanQuery directly, avoiding possibly dangerous combinatoric explosion of all paths as the graph gets larger.
While this is mostly just an optimization, it does raise some tricky questions about how graph queries should be scored, how boolean operator should apply to the expanded synonyms, and how settings like
minShouldMatch should work.
For example, if the user did not put quotes around the query, but a multi-token synonym is inserted, should that synonym use a phrase query or just the default query parser operator? Hopefully real-world experiences with graph queries over time will shed some light on these questions.
More graph filters
Besides synonyms, Lucene has other token filters that should produce graphs!
In the next (6.5.0) Lucene release,
WordDelimiterFilter will be deprecated and replaced with
WordDelimiterGraphFilter resulting in a correct token graph for input tokens like
JapaneseTokenizer, based on the Kuromoji Japanese morphological analyzer, already produces a token graph to express a whole word at the same time as possible sub-words. Numerous other token filters, such as shingles, ngrams and de-compounding token filters, should produce a graph output, but they have not been fixed yet. Patches welcome!
To make multi-token synonyms work correctly you must apply your synonyms at query time, not index-time, since a Lucene index cannot store a token graph. Search-time synonyms necessarily require more work (CPU and IO) than index-time synonyms, since more terms must be visited to answer the query, but the index will be smaller. Search-time synonyms are also more flexible, since you don't need to re-index when you change your synonyms, which is important for very large indices.
Another challenge is that
SynonymGraphFilter, while producing correct graphs, cannot consume a graph, which means you cannot for example use
JapaneseTokenizer followed by
SynonymGraphFilter and expect synonyms to match the incoming graph fragments.
This is quite tricky to fix given the current
TokenStream APIs, and there is a compelling experimental branch to explore a simpler API for creating graph token streams. While the synonym filter on that branch can already consume a graph, there is still plenty of work to be done before the change can be merged (patches welcome!).