This Week in Elasticsearch and Apache Lucene - 2017-01-23
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Cathay Pacific improves incident management with Elasticsearch https://t.co/PbtqBL45UV via @ComputerWorldHK
— Nancy Klahn (@MamaKlahn) January 23, 2017
Elasticsearch Core
Adjacency matrix aggregation
We have a new adjacency_matrix aggregation, which allows to analyze co-occurence of filters, e.g. for the following list of terms, tell me how often two of these terms occur together. It has been built in order to improve the graph functionality so that users can better dive into how different nodes from a graph are connected. For instance, put under a date_histogram, it could help analyze how fraudulent bank accounts have exchanged money over time. It is however likely that users will find exciting use-cases for this aggregation outside of the context of graph.
Improved performance of numeric range queries
When you ask Lucene to filter a query with a numeric range its first step is to build up a bitset marking all documents accepted by your range filter by visiting the dimensional points (BKD tree). This can result in unexpectedly poor performance when the range accepts many documents yet the other parts of the query are restrictive. But with this change, quietly pushed this past week, for the future Lucene 6.5.0 release, Lucene is now smarter: it is able to check up front the expected cost of enumerating all hits for the range versus the expected cost of the other query clauses, and if the range is more costly, it will instead first use the other clauses to enumerate candidate hits and for each hit it will use doc values, instead of dimensional points, to check if it falls within the range filter. For queries that combine restrictive clauses with non-restrictive ranges this can be an enormous speedup. All that is required is you index your range fields using both doc values and points, and then use the IndexOrDocValuesQuery at search time, just graduated to Lucene's core module to express the range.
Changes in 5.2:- Upgraded to Lucene 6.4.0.
- Document failure need specific handling to separate them from other exceptions in InternalEngine which should fail the IndexWriter. Also replaced EngineClosedException with AlreadyClosedException from Lucene.
- When a
script_score
function infunction_query
combines_score
andweight
, the score was incorrectly returned as0
. - Close InputStream when receiving cluster state in PublishClusterStateAction to avoid leaking memory.
- Index creation and settings update may not return deprecation logging headers. Deprecation headers should also be preserved when present on a transport protocol response.
- Snapshot/restore now checks the
index.latest.blob
instead of trying to pick the highestindex-N.blob
. - Shadow replicas are deprecated.
- The
flatten_graph
token filter takes the output of thesynonym_graph
token filter and flattens it for indexing. - The
stempel
token filter was not thread safe.
- The new cross-cluster search client has landed.
- Custom routing can now target a group of shards instead of just a single shard.
- The Java High-Level REST client continues to expand, now supporting
delete
responses, search-shard-failure responses, search-profile-shard responses, andupdate
responses. - Painless can now encode and decode base64.
- The profile API should return timings in machine readable format by default, with
?human
for human-readable. - Logging now exposes the logs base-path, the cluster name, and the node-name for more log-file flexibility.
- Certain comma-delimited array settings now accept trailing spaces.
- Guice has been removed from REST handlers.
- The S3 repository plugin now accepts secure settings, so specifying settings via environment variables, system properties, or profile files is now deprecated.
- All booleans are strictly parsed and now accept only
true
,false
,"true"
, or"false"
. - After three years of existence, aggregations are finally getting unit tests.
- Blocking versions of the TCP transport server, TCP transport client, and HTTP server have been removed.
- CRUD requests which result in version conflicts should still be replicated to ensure a complete sequence ID history.
- Core no longer requires the
accept
SocketPermission. Related to this, the REST http client used by (eg) reindex is now wrapped in a doPrivileged block, as are socket connection operations in plugins.
- A new Azure ARM discovery plugin.
- A content-type header will be required for all HTTP requests with a body.
- The LogLogBeta algorithm for the cardinality aggregation may be more efficient and more accurate than HyperLogLog.
Apache Lucene
- The Lucene 6.4.0 release vote has passed and the bits will soon be set free, despite Maven fighting back!
- Lucene now more carefully picks how to execute possibly costly range filters
IndexOrDocValuesQuery,
which can run the same range filter either with points or with doc values, has graduated fromsandbox
tocore
but seems to have caused an intermittentNullPointerException
- Can we make it easier to build graph token filters?
- Query parsers should create much more efficient queries when they encounter a graph token stream to avoid combinatoric explosion
- If a range filter will match more than half the index we now speed up the filtering by inverting the range
WordDelimiterGraphFilter,
replacingWordDelimiterFilter,
finally works with positional queries correctly at search timeEdgeNGramTokenFilter
deletes incoming token payloads- Lucene 6.5.0 will now sort new segments when they are initially written and not on first merge
FieldTerms
doesn't seem a much better name thanTerms
; naming is hard)!- Suffix arrays can provide fast leading-wildcard searches, but the merits vs. an FST on the reversed string is debatable
- An innocent issue about adding sugar constructors to
TermQuery
quickly lead to proposing removingTerm
entirely - Why does Java let you put more than one class into a source file?
- We are now taking more of
javac's
warnings seriously HTMLStripCharFilter
should not include the closing tag in its token offsets- Can we fix
CompressingStoredFieldsFormat
to release native memory more aggressively without harming stored fields retrieval performance too much? Geo3D
hit another curious failure- Lucene is not designed to support millions of unique fields
- Nested
SpanNearQueries
miss some hits today, but the fix is surprisingly tricky and only addresses some cases - The Kuromoji (Japanese) tokenizer fails if the user tries to include
#
in their dictionary - Estimation of the number of hits in dimensional points gets faster
- Some regular expressions may cause a
NullPointerException
inAutomatonTermsEnum
- The very large switch statement in
ASCIIFoldingFilter
causes slow performance - Should we make it easier to get the per-hit matching scorers in
DisjunctionScorer?
- We finally require all terms to
TermInSetQuery
to come from the same field - Sometimes, on backporting a complex change like index sorting, you find things you should fix back on the original master commit too
- We now use the JDK's
Arrays.binarySearch
instead of our own implementation inBaseCharFilter
- It's unreasonably difficult to intersect an automaton with the terms from doc values
- Starting with Lucene 7.0,
IndexWriter
will reject broken offsets FieldComparatorSource.newComparator
no longer throwsIOException
WordDelimiterGraphFilter
cannot correct offsets when characters were remapped before tokenization, and it now corrects offsets that try to go backwards- The new
FunctionScoreQuery
andFunctionMatchQuery
use the newDoubleValuesSource
for scoring and matching.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!