This Week in Elasticsearch and Apache Lucene - 2017-01-23
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Adjacency matrix aggregation
We have a new adjacency_matrix aggregation, which allows to analyze co-occurence of filters, e.g. for the following list of terms, tell me how often two of these terms occur together. It has been built in order to improve the graph functionality so that users can better dive into how different nodes from a graph are connected. For instance, put under a date_histogram, it could help analyze how fraudulent bank accounts have exchanged money over time. It is however likely that users will find exciting use-cases for this aggregation outside of the context of graph.
Improved performance of numeric range queries
When you ask Lucene to filter a query with a numeric range its first step is to build up a bitset marking all documents accepted by your range filter by visiting the dimensional points (BKD tree). This can result in unexpectedly poor performance when the range accepts many documents yet the other parts of the query are restrictive. But with this change, quietly pushed this past week, for the future Lucene 6.5.0 release, Lucene is now smarter: it is able to check up front the expected cost of enumerating all hits for the range versus the expected cost of the other query clauses, and if the range is more costly, it will instead first use the other clauses to enumerate candidate hits and for each hit it will use doc values, instead of dimensional points, to check if it falls within the range filter. For queries that combine restrictive clauses with non-restrictive ranges this can be an enormous speedup. All that is required is you index your range fields using both doc values and points, and then use the IndexOrDocValuesQuery at search time, just graduated to Lucene's core module to express the range.Changes in 5.2:
- Upgraded to Lucene 6.4.0.
- Document failure need specific handling to separate them from other exceptions in InternalEngine which should fail the IndexWriter. Also replaced EngineClosedException with AlreadyClosedException from Lucene.
- When a
weight, the score was incorrectly returned as
- Close InputStream when receiving cluster state in PublishClusterStateAction to avoid leaking memory.
- Index creation and settings update may not return deprecation logging headers. Deprecation headers should also be preserved when present on a transport protocol response.
- Snapshot/restore now checks the
index.latest.blobinstead of trying to pick the highest
- Shadow replicas are deprecated.
flatten_graphtoken filter takes the output of the
synonym_graphtoken filter and flattens it for indexing.
stempeltoken filter was not thread safe.
- The new cross-cluster search client has landed.
- Custom routing can now target a group of shards instead of just a single shard.
- The Java High-Level REST client continues to expand, now supporting
deleteresponses, search-shard-failure responses, search-profile-shard responses, and
- Painless can now encode and decode base64.
- The profile API should return timings in machine readable format by default, with
- Logging now exposes the logs base-path, the cluster name, and the node-name for more log-file flexibility.
- Certain comma-delimited array settings now accept trailing spaces.
- Guice has been removed from REST handlers.
- The S3 repository plugin now accepts secure settings, so specifying settings via environment variables, system properties, or profile files is now deprecated.
- All booleans are strictly parsed and now accept only
- After three years of existence, aggregations are finally getting unit tests.
- Blocking versions of the TCP transport server, TCP transport client, and HTTP server have been removed.
- CRUD requests which result in version conflicts should still be replicated to ensure a complete sequence ID history.
- Core no longer requires the
acceptSocketPermission. Related to this, the REST http client used by (eg) reindex is now wrapped in a doPrivileged block, as are socket connection operations in plugins.
- A new Azure ARM discovery plugin.
- A content-type header will be required for all HTTP requests with a body.
- The LogLogBeta algorithm for the cardinality aggregation may be more efficient and more accurate than HyperLogLog.
- The Lucene 6.4.0 release vote has passed and the bits will soon be set free, despite Maven fighting back!
- Lucene now more carefully picks how to execute possibly costly range filters
IndexOrDocValuesQuery,which can run the same range filter either with points or with doc values, has graduated from
corebut seems to have caused an intermittent
- Can we make it easier to build graph token filters?
- Query parsers should create much more efficient queries when they encounter a graph token stream to avoid combinatoric explosion
- If a range filter will match more than half the index we now speed up the filtering by inverting the range
WordDelimiterFilter,finally works with positional queries correctly at search time
EdgeNGramTokenFilterdeletes incoming token payloads
- Lucene 6.5.0 will now sort new segments when they are initially written and not on first merge
FieldTermsdoesn't seem a much better name than
Terms; naming is hard)!
- Suffix arrays can provide fast leading-wildcard searches, but the merits vs. an FST on the reversed string is debatable
- An innocent issue about adding sugar constructors to
TermQueryquickly lead to proposing removing
- Why does Java let you put more than one class into a source file?
- We are now taking more of
HTMLStripCharFiltershould not include the closing tag in its token offsets
- Can we fix
CompressingStoredFieldsFormatto release native memory more aggressively without harming stored fields retrieval performance too much?
Geo3Dhit another curious failure
- Lucene is not designed to support millions of unique fields
SpanNearQueriesmiss some hits today, but the fix is surprisingly tricky and only addresses some cases
- The Kuromoji (Japanese) tokenizer fails if the user tries to include
#in their dictionary
- Estimation of the number of hits in dimensional points gets faster
- Some regular expressions may cause a
- The very large switch statement in
ASCIIFoldingFiltercauses slow performance
- Should we make it easier to get the per-hit matching scorers in
- We finally require all terms to
TermInSetQueryto come from the same field
- Sometimes, on backporting a complex change like index sorting, you find things you should fix back on the original master commit too
- We now use the JDK's
Arrays.binarySearchinstead of our own implementation in
- It's unreasonably difficult to intersect an automaton with the terms from doc values
- Starting with Lucene 7.0,
IndexWriterwill reject broken offsets
FieldComparatorSource.newComparatorno longer throws
WordDelimiterGraphFiltercannot correct offsets when characters were remapped before tokenization, and it now corrects offsets that try to go backwards
- The new
FunctionMatchQueryuse the new
DoubleValuesSourcefor scoring and matching.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!