This Week in Elasticsearch and Apache Lucene - 2016-08-01
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
The customer asked, "What was that?" We replied, "Oh, that was Elastic." https://t.co/OwY8E4L3eX via @elastic
— Christoph Wurm (@ChristophWurm) July 26, 2016
Elasticsearch Core
Changes in 2.x:- A multi-match query with wildcard field names which result in no matching fields now returns a no-match query.
- The plain highlighter should ignore parent/child queries.
- Upgraded to Lucene 5.5.2.
- All CRUD requests return an
_operation
to indicate what action was taken, although this will be renamed toresult
. Thecreated
andfound
responses have been deprecated, and the Java methods removed. - The
foreach
ingest processor did not allow mutating other fields in the document. - The
template
query has been deprecated in favour of search templates. - Reindex-from-remote now supports basic authentication, and doesn't need as many threads as provided by the REST client by default. The
_version
field is only requested if needed. - An overflow bug in the JVM was causing disk-free-space on massive drives to be reported as negative.
- Rolled over index names are zero-padded by default, unless a name for the new index is provided.
- The default
shard_size
used for aggregations was overly aggressive, resulting in more memory use than required. - Snapshot UUIDs are used to identify associated blobs, making blob deletion safer.
- The
count
API now accepts an empty search body, just likesearch
. forced_refresh
should only be included in the response if forced refresh was requested.- The Java REST client has simpler sniffer initialisation.
- Refactored variable chains in Painless to make the AST much more natural.
- Time values have case-insensitive units except for
m
(minutes) to avoid confusion withM
(months). - Index wildcards in cluster state requests are now also applied to the routing table.
- The cat-shards request didn't support index patterns either.
- The
elasticsearch-translog
command line tool allows truncation of corrupted translogs to salvage data in the index. - The
netty4
module now depends on a released version of Netty, rather than our own temporary fork. - Jars required by the transport client have been renamed to include
-client
in the artifact ID. - The get-pipeline request now returns a named hash instead of an array for consistency with other APIs.
- Fixed explanation for function_score queries where no filters match.
- Index, update, and create PUT requests now return the LOCATION header of the new document.
- The
_gce_
networking special value is available in the GCE discovery plugin, even when not usinggce
discovery. - The EC2 discovery plugin now uses the recommended
DefaultAWSCredentialsProviderChain
to discover credentials. - The YAML REST tests are now called
ClientYamlTest...
to separate them from Java tests which test the REST layer.
- The completion suggester should return documents instead of individual fields.
- Matching documents returned by the search relevance framework should include the
_index
and_type
. - Elasticsearch should be able to bind to virtual network interfaces.
- Upgrade Jackson to 2.8.0 to fix a bug producing invalid JSON.
- The
recoverySource
field inShardRouting
provides explicit information about where the shard is recovering from. - Inlined parameters in scripts cause too frequent script compilations.
- Concurrent Store metadata listing has race conditions with index writes.
- A bug could cause dangling indices to be deleted instead of imported.
- The
write_consistency
parameter will be replaced withwait_for_active_shards
. - Cancellable threads should not allow
Thread#interrupt()
- Regular histograms should be split from date histograms to allow fractional and negative buckets.
- A primary should only be able to fail a replica if it has the current primary term.
- The function score query can use a script to combine scores from other functions.
Apache Lucene
- Writing dimensional points will be substantially faster in 6.2 thanks to two separate changes, showing a 38% overall speedup when indexing 1.2 billion NYC taxi rides and obsoleting this prior change
- Tokenizing the Myanmar language got unexpectedly worse so we've restored the old syllable tokenization
AssertingPointsFormat
had a silly bug preventing it from checking the wrapped points implementation correctlyMinHashFilter
intentionally uses fall-through in itsswitch
statementsIndexWriter
was way too verbose when indexing threads become stalled because writing new segments can't keep up- The "thin wrapper" Lucene demo server has been moved to its own github project
- Near-real-time replication was missing some public APIs, uncovered when folding it into the Lucene demo server
MemoryIndexReader.fields
is no longer accidentally 5X slower- Dimensional points were failing to enforce maximum per-dimension byte count correctly
- The new
RangeFieldQuery,
to index intervals and search by overlapping ranges, had a buggyequals
implementation - Nested span queries somehow broke between 4.10.x and today
FastVectorHighlighter
may have a performance regression since 4.10.x- Efforts to provide analyzers based on OpenNLP project are progressing after years of dying-on-the-vine
- A test bug in
MoreLikeThisTest
still remains tricky to fix MemoryIndex
needs some cleanup, including a builder API to create an immutable instance- A new
GeoBoundingBoxField
will index a lat/lon bounding box as a single 4D point, and could also index altitude as a 3rd dimension
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!