This Week in Elasticsearch and Apache Lucene - 2016-05-17
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Video of my "Ingest Node: (re)indexing and enriching documents within #elasticsearch" talk given at #DevCon16 is up
— Luca Cavanna (@lucacavanna) May 13, 2016
Elasticsearch Core
Changes in 2.x:
- Reindex throttling has been backported.
- A wildcard on a stopword produced an NPE in simple_query_string.
- CORS handling no longer checks for User-Agent.
- CORS permits same-origin requests.
- Fixed CONTAINS relationship in geo_shape query.
Changes in master:
- Dots in field names are supported again!
- Shell scripts in Elasticsearch now depend on bash instead of sh.
- The plugin script no longer accepts Java system properties as command line params.
- The fieldstats API now only returns info for fields that exist in the Lucene index.
- Scripting engine settings no longer support the "sandbox" option - they accept only true or false.
- Scripting engines can now register only a single script type name and file extension.
- The reindex and update-by-query APIs now both return a BulkIndexByScrollResponse.
- The Painless scripting language received many low-level performance and usability improvements.
- Elasticsearch min/max heap now defaults to 256MB and 2GB.
- Shard routing is now immutable.
- Fixed a concurrency bug in IndexingMemoryController which could result in miscounts and even OOM.
- Iterables.flatten should not pre-cache the first iterator.
- Reindex batches default to 1000 docs, instead of 100.
- Added missing setting: `discovery.ec2.tag.project`.
- The cat-fielddata API now returns fields as rows instead of columns.
- The significant terms agg can now be used on fields indexed as points (ie date, numeric, ip).
- Dangling indices are no longer imported if a tombstone for the index exists.
- The fingerprint analyzer now dedupes tokens after ASCII-folding.
- The in_flight requests circuit breaker now excludes PingRequest and MasterPingRequest.
- Terms aggs on IP fields return IP addresses as string keys.
- The `fuzziness` parameter now throws an exception when used in multi-match cross_types, phrase, or phrase_prefix queries, instead of being silently ignored.
- Fuzzy, regexp, prefix, and wildcard queries can now be used only on text/keyword fields. Attempting to use them on numeric, date, ip, _id, or _uid fields will throw an exception.
Ongoing:
- Profiler being refactored to make way for profiling more than just queries.
- Snapshot index files should be written atomically and reflect the true contents of the snapshot.
- Deprecation warnings should be returned to HTTP clients as headers.
- Matrix aggregations bring multi-field correlations.
- Work continues on making new point-based IP fields backwards compatible.
- Splitting scroll requests for processing by multiple consumers.
- Block indexing requests until their changes have been refreshed and are visible to search.
Apache Lucene
- Index-time sorting is now supported directly in Lucene core, but we still need to take advantage of a sorted index at search time by default, and explore sorting during flush as well
- The legacy
SlowCompositeReaderWrap<wbr>per,
an awful class that inefficiently tries to pretend you have only one segment in your index, can at long last move away from Lucene - Lucene's classic query parser should let the analyzer handle splitting tokens on whitespace if necessary rather than do that itself
- The legacy spatial module could wrap the new
GeoPointField
as aSpatialStrategy
- Maybe we should more aggressively compress the terms dictionary?
- Our
ant
clean-jars
task struggles with symbolic links DateRangePrefixTree
lets you control the calendar templateJapaneseTokenizer
sometimes unexpectedly throws ArrayIndexOutOfBoundsEx<wbr>ceptions
- More improvements to
geo3d
polygon handling Geo3d
needs a doc-values field to enable sort-by-distance- Upgrading our JFlex-based tokenizers to Unicode 8.0 is tricky
- WebStart's security manager is angry about Lucene's
Constants
class checking system properties - Why are
equals
andhashCode
not abstract in theQuery
base class? - Should
MatchNoDocsQuery
include a reason for its creation? TestMoreLikeThis
is still failing likely due to these recent changes- A new LSH (locality sensitive hashing)
TokenFilter
and query is an alternative to the standardMoreLikeThisQuery
- We have too many confusing versions in Jira
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!