This Week in Elasticsearch and Apache Lucene - 2016-07-18
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
I just published “How we reindexed 36 billions documents in 5 days within the same Elasticsearch cluster” https://t.co/7W20zpBHnq
— Fred de Villamil ✌︎ (@fdevillamil) July 6, 2016
Elasticsearch Core
Changes in 2.x:- The fast vector highlighter was unable to extract highlight terms from nested queries.
- Netty3 is now a module, allowing us to add a Netty4 module which can also be shipped with Elasticsearch. During testing, Elasticsearch uses a blocking socket-based MockTcpTransport instead of relying on netty directly.
- A create-index request (and similar) will now wait until an index is writable before returning. No more waiting for yellow.
- Creating a new index will now turn the cluster health yellow instead of red (unless something goes wrong), allowing sysadmins to sleep peacefully.
- Scroll requests have a safeguard on the maximum
size
, similar to the safeguard that already exists for search. - The cluster allocation explain API can now report on disk usage.
- The Java REST client supports swapping out the internal HTTP client in order to add async support.
- Plugins extending search will need to implement SearchPlugin instead of onModule(SearchModule).
- Jars for plugins and modules which can be used with the TransportClient will be published to Maven.
- The
foreach
ingest processor now wraps a single processor only, instead of an array of processors. This makes the UI and error reporting simpler. - Intentionally ignored errors in ingest pipelines are reported in verbose simulate method for easier debugging.
- The cluster-health API can now
wait_for_events
to ensure that (eg) settings updates have been propagated before proceeding. - NotMasterException is important enough to be its own exception.
- The profile API was incorrectly counting children profile timings twice. Also, it now reports the number of times important methods are called.
- Settings may no longer be specified using the Java properties syntax.
- InnerHits should use rewritten queries to avoid later exceptions.
- The old script and template syntax deprecated in 1.x has been removed.
- All snapshot restore plugins now use a single deleteBlob() method.
- Improved bootstrap logging to explain to the user why bootstrap checks are being enforced.
- Renaming a REST endpoint and deprecating the old endpoint can be done in a single step with registerWithDeprecatedHandler.
- The
node.mode
andnode.local
settings for testing have been removed in favour of settingdiscovery.type
andtransport.type
specifically. - The Template class has been removed in favour of using the Script class directly with lang
mustache
. Thetemplate
query has been moved to the mustache module.
- More aggregations have been migrated to NamedWritable instead of aggregation streams.
- Guice is being removed from Elasticsearch one PR at a time, resulting in a better defined Plugin API.
- Unknown
match_mapping_types
should be rejected when creating dynamic mappings. - Creating too many aggregation buckets should trip a circuit breaker.
- The concept of write_consistency is being removed in favour of waiting for a specified number of shards to be available.
- Script access to term statistics was buggy and performed badly. It is being removed in favour of using a custom Similarity.
- The Java REST client will support blocking and async requests.
- Benchmark the new Java REST client vs the TransportClient.
- Indexing requests should not be lost to recovering replicas when the primary shard is relocated.
- Snapshots will use UUIDs instead of names for snapshots and for indices.
- S3 repository plugin will support path style access.
- A CLI tool will allow corrupted translogs to be removed in order to recover data already stored in the index.
- Refactoring variable chains in Painless to make the AST much more natural.
- Nodes will be assigned node names that persist across restarts.
- Netty4 module development is revealing tricky bugs in Netty.
- A new
geo
field type could combine points and shapes into a single field. - Macro and micro benchmarks should be run as part of CI.
Apache Lucene
CustomAnalyzer
was accidentally switched to use the wrong default attribute factory, but Uwe is fixing it- Now it's simple to create a
Polygon
from a GeoJSON string - Dimensional points now use run-length compression of the leading prefix byte of one of the dimensions, resulting in a nice drop in the index size nightly geo benchmarks (annotation R)
Directory.renameFile
was also doing anfsync
on every call, which can be very costly if you're renaming many files, which Elasticsearch does when copying a shard- A problematic assertion inside
IndexWriter
was obsolete - We were missing dimensional points in our backwards compatibility test indices!
- Lucene's
FSDirectory
classes all immediately create the directory but this is problematic for read-only use cases so maybe we can make it optional - Continuing on the modernization of Lucene's scoring,
queryNorm
has now been removed, and we are considering removing the outgunnedClassicSimilarity
entirely - Fixing
TermQuery
to delay term lookups until they are truly needed is now simpler and more palatable thanks to recent changes - Efforts to provide analyzers based on OpenNLP project are progressing after years of dying-on-the-vine
MatchNoDocsQuery
now includes an optional reason for its creation- The new
Analyzer.normalize
method normalizes an input string, obsoletingAnalyzingQueryParser
FastVectorHighlighter
can now handle block join queries, but the test case had oneBoostQuery
too manyant
precommit's
code style checks now require the Apache license header above thepackage
declaration, with a blank line in between, and IntelliJ's configuration has also been fixed- The new
FilterWeight
lets you override specificWeight
methods of another delegatedWeight
instance - A new
RangeField
will index a multi-dimensional range, such as a day range in a calendar, using dimensional points HunspellStemFilter
does not behave likehunspell
on the Hungarian text, but this is by design- A user trying to search Chinese text is struggling with the many changes between Lucene versions 2.3 and 5.0
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!