23 January 2017

This Week in Elasticsearch and Apache Lucene - 2017-01-23

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Cathay Pacific improves incident management with Elasticsearch https://t.co/PbtqBL45UV via @ComputerWorldHK
— Nancy Klahn (@MamaKlahn) January 23, 2017

Elasticsearch Core

Adjacency matrix aggregation

We have a new adjacency_matrix aggregation, which allows to analyze co-occurence of filters, e.g. for the following list of terms, tell me how often two of these terms occur together. It has been built in order to improve the graph functionality so that users can better dive into how different nodes from a graph are connected. For instance, put under a date_histogram, it could help analyze how fraudulent bank accounts have exchanged money over time. It is however likely that users will find exciting use-cases for this aggregation outside of the context of graph.

Improved performance of numeric range queries

When you ask Lucene to filter a query with a numeric range its first step is to build up a bitset marking all documents accepted by your range filter by visiting the dimensional points (BKD tree). This can result in unexpectedly poor performance when the range accepts many documents yet the other parts of the query are restrictive. But with this change, quietly pushed this past week, for the future Lucene 6.5.0 release, Lucene is now smarter: it is able to check up front the expected cost of enumerating all hits for the range versus the expected cost of the other query clauses, and if the range is more costly, it will instead first use the other clauses to enumerate candidate hits and for each hit it will use doc values, instead of dimensional points, to check if it falls within the range filter. For queries that combine restrictive clauses with non-restrictive ranges this can be an enormous speedup. All that is required is you index your range fields using both doc values and points, and then use the IndexOrDocValuesQuery at search time, just graduated to Lucene's core module to express the range.

Changes in 5.2:

Upgraded to Lucene 6.4.0.
Document failure need specific handling to separate them from other exceptions in InternalEngine which should fail the IndexWriter. Also replaced EngineClosedException with AlreadyClosedException from Lucene.
When a script_score function in function_query combines _score and weight, the score was incorrectly returned as 0.
Close InputStream when receiving cluster state in PublishClusterStateAction to avoid leaking memory.
Index creation and settings update may not return deprecation logging headers. Deprecation headers should also be preserved when present on a transport protocol response.
Snapshot/restore now checks the index.latest.blob instead of trying to pick the highest index-N.blob.
Shadow replicas are deprecated.
The flatten_graph token filter takes the output of the synonym_graph token filter and flattens it for indexing.
The stempel token filter was not thread safe.

Changes in 5.x:

The new cross-cluster search client has landed.
Custom routing can now target a group of shards instead of just a single shard.
The Java High-Level REST client continues to expand, now supporting delete responses, search-shard-failure responses, search-profile-shard responses, and update responses.
Painless can now encode and decode base64.
The profile API should return timings in machine readable format by default, with ?human for human-readable.
Logging now exposes the logs base-path, the cluster name, and the node-name for more log-file flexibility.
Certain comma-delimited array settings now accept trailing spaces.
Guice has been removed from REST handlers.
The S3 repository plugin now accepts secure settings, so specifying settings via environment variables, system properties, or profile files is now deprecated.

Changes in master:

All booleans are strictly parsed and now accept only true, false, "true", or "false".
After three years of existence, aggregations are finally getting unit tests.
Blocking versions of the TCP transport server, TCP transport client, and HTTP server have been removed.
CRUD requests which result in version conflicts should still be replicated to ensure a complete sequence ID history.
Core no longer requires the accept SocketPermission. Related to this, the REST http client used by (eg) reindex is now wrapped in a doPrivileged block, as are socket connection operations in plugins.

Coming up:

A new Azure ARM discovery plugin.
A content-type header will be required for all HTTP requests with a body.
The LogLogBeta algorithm for the cardinality aggregation may be more efficient and more accurate than HyperLogLog.

Apache Lucene

The Lucene 6.4.0 release vote has passed and the bits will soon be set free, despite Maven fighting back!
Lucene now more carefully picks how to execute possibly costly range filters
IndexOrDocValuesQuery, which can run the same range filter either with points or with doc values, has graduated from sandbox to core but seems to have caused an intermittent NullPointerException
Can we make it easier to build graph token filters?
Query parsers should create much more efficient queries when they encounter a graph token stream to avoid combinatoric explosion
If a range filter will match more than half the index we now speed up the filtering by inverting the range
WordDelimiterGraphFilter, replacing WordDelimiterFilter, finally works with positional queries correctly at search time
EdgeNGramTokenFilter deletes incoming token payloads
Lucene 6.5.0 will now sort new segments when they are initially written and not on first merge
FieldTerms doesn't seem a much better name than Terms ; naming is hard)!
Suffix arrays can provide fast leading-wildcard searches, but the merits vs. an FST on the reversed string is debatable
An innocent issue about adding sugar constructors to TermQuery quickly lead to proposing removing Term entirely
Why does Java let you put more than one class into a source file?
We are now taking more of javac's warnings seriously
HTMLStripCharFilter should not include the closing tag in its token offsets
Can we fix CompressingStoredFieldsFormat to release native memory more aggressively without harming stored fields retrieval performance too much?
Geo3D hit another curious failure
Lucene is not designed to support millions of unique fields
Nested SpanNearQueries miss some hits today, but the fix is surprisingly tricky and only addresses some cases
The Kuromoji (Japanese) tokenizer fails if the user tries to include # in their dictionary
Estimation of the number of hits in dimensional points gets faster
Some regular expressions may cause a NullPointerException in AutomatonTermsEnum
The very large switch statement in ASCIIFoldingFilter causes slow performance
Should we make it easier to get the per-hit matching scorers in DisjunctionScorer?
We finally require all terms to TermInSetQuery to come from the same field
Sometimes, on backporting a complex change like index sorting, you find things you should fix back on the original master commit too
We now use the JDK's Arrays.binarySearch instead of our own implementation in BaseCharFilter
It's unreasonably difficult to intersect an automaton with the terms from doc values
Starting with Lucene 7.0, IndexWriter will reject broken offsets
FieldComparatorSource.newComparator no longer throws IOException
WordDelimiterGraphFilter cannot correct offsets when characters were remapped before tokenization, and it now corrects offsets that try to go backwards
The new FunctionScoreQuery and FunctionMatchQuery use the new DoubleValuesSource for scoring and matching.

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2017-01-23

Elasticsearch Core

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS