This Week in Elasticsearch and Apache Lucene - 2016-07-26
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Wrote a post about combining #elasticsearch RestClient with #jtwig templates for creating and executing queries: https://t.co/LO8fCj7lqk
— Jettro Coenradie (@jettroCoenradie) July 25, 2016
Elasticsearch Core
Changes in 2.x:
- S3 repository now supports path style access for virtual hosting of buckets.
- S3 and EC2 should allow for different AWS key pairs.
not_analyzed
string fields should rejectposition_increment_gap
.
Changes in master:
- The new
scaled_float
field type allows eg percentages to benefit from compression techniques used forlong
fields. - The result cache can be explicitly enabled per-request even when it returns search hits.
- Reindex throttling is now disabled with
-1
instead ofunlimited
. - Automatically created indices should honour
index.mapper.dynamic
. - The analyze API now supports defining custom character filters, token filters, and tokenizers inline.
- Indexing into a relocating primary while replicas are recovering will no longer result in document loss.
- Elasticsearch should reject dynamic templates with unknown
match_mapping_type
. - The Java REST client now supports async and blocking requests, and a benchmark compares transport client to HTTP.
- Blocking tasks should not be run on the Scheduler thread.
- Resetting a recovery respects reference counting and locking, closes streams and removes all files.
- The cardinality aggregation now has a fixed default precision which is easier for the user to understand.
- The request circuit breaker now takes aggregation buckets into account.
- Automatically generated node names now persist after the node restarts.
- Scripts used in ingest pipelines preserve the original exception for easier debugging.
- Triggering
on_failure
should halt any further ingest processing. - Analyzer aliases now work correctly, but will be removed for 5.0.
- CORS default settings for
allow-methods
andallow-headers
were not being used. - A Netty4 module is available, but depends on as yet unreleased bug fixes in Netty.
- Nested queries can be used inside nested aggregations.
- Static methods on Store class need to be shard lock aware to avoid race conditions.
- All aggs have been moved to use NamedWritable instead of AggregationStreams.
- Plugins registering queries should use the
SearchPlugin
interface. - Mappings introduced a
_parent#null
field when parent/child was not used. - TCP transports should map their internal exceptions to those defined by the TCPTransport class.
- Index, update, and create REST requests return a LOCATION header.
Ongoing:
- The search relevance framework gains a REST interface and support for reciprocal rank and discounted cumulative gain.
- A command line tool will allow you to lose data in a corrupt transaction log while recovering data already in the index.
- Histograms and date histograms are being split so that the former can bucket on decimal values.
- Elasticsearch should be able to listen to virtual network interfaces.
- The
write_consistency
setting will be replaced withwait_for_active_shards
which better describes the intent. - The new completion suggester should return matching documents.
- Similarities should be dynamically updatable.
- Reindex from remote should support reading from clusters that require authentication.
- Rally should decouple job scheduling from execution, which will allow support for multiple load generators.
- Configuring network partitions in tests should be easier.
- Shard copies should only be marked as stale after an acknowledged write.
Apache Lucene
CustomAnalyzer
was accidentally switched to use the wrong default attribute factory, but Uwe fixed it- A new
DoubleRangeField
will index a multi-dimensional range, such as a day range in a calendar, using dimensional points whichRangeFieldQuery
can then search by overlapped range;IntRangeField,
FloatRangeField
andLongRangeField
are coming next - Cached
TermQuery
no longer seeks the terms dictionary - Indexing performance tests on the 1.2 B documents New York City taxi rides corpus uncovered performance problems when writing dimensional points to disk using a large indexing buffer
MemoryIndexReader.fields
became accidentally 5X slower recently- Dimensional points were failing to enforce maximum per-dimension byte count correctly
- The flexible query parse will also be fixed to not pre-split on whitespace, letting the analyzer do that instead
DecimalDigitFilter
has problems with digits that use Unicode's non-BMP supplemental charactersant
1.9.6 somehow breaks Lucene's generated file formats javadocs link- Now that coord is gone,
BooleanQuery
can be optimized to flatten any nested disjunctions, often created by rewritten queries - Some small javadocs improvements were made to
LeafFieldComparator
- The obsoleted
ScoringWrapperSpans
is gone SpanScorer
had poorassert
messages- Lucene's builds now run to completion even if some tests fail
JGit
is upgraded to version 4.4.1 in Lucene's build scripts
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!