This Week in Elasticsearch and Apache Lucene - 2016-05-09
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Detecting Geo-Temporal anomalies with #Elasticsearch pipeline aggs Blog post: http://bit.ly/1TKfoMf #gis
— Dave Erickson (@davebenigno) May 4, 2016
Elasticsearch Core
Changes in 2.x:
- Don't try to compute completion stats on a closed reader - can cause the JVM to crash.
- HTTP compression didn't work when CORS was enabled.
- Using a wildcard to specify which fields to highlight will only return string fields.
Changes in master:
- HTTP compression is enabled by default (compression level 3) and compressed HTTP requests can always be accepted regardless of whether compression is enabled or not.
- Lucene expressions gained the doc[field].empty property and support for geo_point fields.
- Painless field and method accesses got a 20% speed boost, and gained support for .value/.values (with a 400% speed boost) and support for geo_point fields.
- Painless now supports single quotes, which makes scripts in JSON much more readable.
- Scripting docs had a big rewrite.
- Bootstrap checks now check that the server JVM is in use.
- Added an escape hatch for those times when don't have control over system properties that are checked in the bootstrap production checks.
- Aggregations like `top_hits` which need access to the _score in a sub-aggregation can now use `breadth_first`.
- Nodes now exchange a handshake when connecting, to exchange node information, cluster name, and Elasticsearch version. This will allow us to extend APIs that are used during initial cluster recovery without a major version change.
- Binary values like the new `ip` fields can now be used for sorting.
- The REST query string parser now understands semi-colons as separators, as well as ampersands.
- ES_JAVA_OPTS nows passed to elasticsearch-plugin.
- The deprecated `string` field cannot be combined with the new `index:true|false` setting.
- All nodes requests now include any node failures that occurred.
- All uses of Strings#splitStringToArray have been replaced with String#split.
- Reindex throttling requests are now parsed more strictly.
- Ingest now checks for missing processors instead of throwing an NPE, and a bug was fixed when collecting pipeline stats.
- We now test AUTOSENSE source snippets in the docs. This tester is being moved to a plugin to make it available to other projects.
- RPMs no longer specify parent directories like /var/run, to avoid clashing with other packages, and edited config files are preserved on upgrade.
Ongoing changes:
- Work has started on refactoring snaphot/restore, with the first step being to remove the Snapshot class in favour of SnapshotInfo. More tasks listed here.
- The allocation explain API will indicated whether shard stats are still pending.
- Delete-by-query plugin is being moved to the reindex infrastructure.
- Remove the `es.` prefix from settings on the command line.
- Add `exclude_template` parameter to templates to avoid matching hidden indices.
Apache Lucene
geo3d
can now run the Russia polygon, with ~11.6 K vertices, producing nearly the same hit count asLatLonPoint
- Numerous
geo3d
polygon optimizations, including improvements to the polygon query tree;geo3d
is now included in Lucene's nightly geo benchmarks going forward - Optimize
TermsQuery
to use aboolean
instead ofHashSet
to record whether only one field is queried - The horrible schema ghost case in Lucene, where some documents in a segment use feature X, but then they were all deleted and merged away, yet the ghosts of feature X remain, continues haunting us
- Index-time sorting should be better supported in Lucene's core
XMLQueryParser
makes it hard for subclasses to create span queries- Collecting filter hits is now faster by making better use of index statistics up front to save work
- A user finally hit one of our particularly satisfying recently added Lucene exceptions, but the exception message was missing some information
- A new patch on this long-standing and controversial issue adds support to encrypt doc values
- Add Logistic Regression support to Lucene's classification module
- We failed to retag
master
issues in our issue tracker after releasing 6.0 - Japanese (Kuromoji) tokenizer should allow filtering tokens from a provided parts-of-speech set
- 800+ new top-level-domains have been created since we last fixed
StandardTokenizer
to detect them, but the JFlex release is taking too long so we will proceed with Plan B QueryParser
should let you sometimes use unescaped internal operator characters- Doc values could optimize for the sparse use case, if we change their read-time API to be an iterator like postings
- How to grow a bitset as you add hits to it is tricky
- It's not clear our
Arrays.TimSort
is any better than the JDK's builtinArrays.sort
- Can we speed up how to skip fields in compressed space when loading a subset of fields from a document?
- Lucene's default offset gap is sometimes surprising
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!