This Week in Elasticsearch and Apache Lucene - 2016-08-29
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Had an awesome time @elastic #meetup presenting zero-downtime re-indexing of #elasticsearch @signalfx - slide deck: https://t.co/AiTdCAoFNW
— Mahdi Ben Hamida (@mahdouch) August 25, 2016
Elasticsearch Core
Changes in 2.x:- Added ref counting to SearchContext to avoid unexpected AlreadyClosedExceptions which could rarely lead to a SIGSEGV on mmap'ed directories.
- Non-scoring term queries on the
_all
field were all considered equal when nested inside abool
, and when the_all
field has different per-field boosts.
- Lucene upgraded to 6.2.0.
- Jackson upgraded to 2.8.1.
- Painless is the new default scripting language. Fixed a bug when using
break
infor
loops. - Realtime GET is now handled by doing an automated refresh instead of reading from the translog.
- Fsync'ing documents is now performed asynchronously, which delivers a 15-30% speedup on slow disks.
- If disaster strikes, AlreadyClosedExceptions should not be suppressed.
- RAM usage estimation of the LiveVersion map was way too high.
- GET requests no longer support
fields
-stored_fields
return stored fields only, while_source
filtering reads from the source. - Shards should not be marked as stale just because a node has been shut down. They should only be marked as stale if there is a subsequent write.
- Blank field names are no longer accepted.
- The
_version
field should not be indexed. - The phase to fetch stored fields can be skipped entirely by setting
stored_fields: _none_
. Especially useful for returning completion suggester results. keyword
fields are now indexed and stored as binary values to avoid an extra UTF8 conversion.- Numbers do not need to be parsed as strings if they are not included in
_all
. - Agg profiling did not support
breadth_first
mode correctly. - ShardRouting now includes RecoverySource which characterises the type of recovery that should be performed.
- Ingest pipelines should not be invalidated on every cluster state update.
cluster.routing.allocation.same_shard.host
setting had not been migrated to the new settings infra.- The script ingest processor should support params, like all other scripts.
- Mapping settings
numeric_detection
,date_detection
, anddynamic_date_formats
were not dynamically updatable. - Cluster stats now report whether netty3 or netty4 is being used.
- The
client-benchmark-noop-api-plugin
makes Elasticsearch do nothing, removing noise from client benchmarks. - Avoid initialising the logger prematurely.
- Async methods in the REST client now have Async appended to distinguish them from sync methods.
- The
index_boost
query was not being cached because of indeterminate hash ordering. - The default cluster settings now accurately reflect which scripting engines are enabled.
- Source filtering on docs with source disabled could trigger an NPE.
- Date range queries should be generated in a way that they can be cached efficiently.
- It should be possible to update string mappings on 2.x indices using 5.x syntax.
- Adding action to update-aliases to make deleting an index and adding an index alias an atomic step.
- Log4j2 PR will be landing soon.
- Ingest processors should support
ignore_missing
as well asignore_failure
. - RankEval requests gain XContent support with roundtrip testing.
- Should the query cache keep track of a longer history?
- Ingest will gain a JSON processor.
- How should dots in field names be supported by ingest?
- Should feature usage stats have its own API or be included in node stats?
- Geo-points in 5.0 will be backed by LatLon fields, which are much faster.
- Macrobenchmarks in Rally are being integrated with CI. Next up, Cloud benchmarking.
Apache Lucene
- Lucene 6.2.0 is released
- Some dead yet scary code deep inside
IndexWriter
is now gone BooleanQuery
now optimizes better when a sub-query occurs more than once- The release script helper that polls mirrors was rewritten from Perl to a better programming language, also starting with P, that has batteries included
- A few more fun
geo3D
corner case failures are fixed - More dead code is gone
- Test should use
CannedTokenStream
- A rare test bug caused a failure because Lucene now refuses to overwrite a file
InetAddressPoint's
javadocs fornewSetQuery
were confusing- Legacy numeric fields continue to disappear
- Sun's JDK bugs became Oracle's
- The security test policy needs to allow reading of the line docs file many tests use as a realistic documents source
ToParentBlockJoinCollector
should be removed- A new regular expression engine using Memory Occurence Automata may lead to better regular expression queries, but we struggle to understand in which cases
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!