This Week in Elasticsearch and Apache Lucene - 2016-07-18
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
I just published “How we reindexed 36 billions documents in 5 days within the same Elasticsearch cluster” https://t.co/7W20zpBHnq
— Fred de Villamil ✌︎ (@fdevillamil) July 6, 2016
Elasticsearch CoreChanges in 2.x:
- The fast vector highlighter was unable to extract highlight terms from nested queries.
- Netty3 is now a module, allowing us to add a Netty4 module which can also be shipped with Elasticsearch. During testing, Elasticsearch uses a blocking socket-based MockTcpTransport instead of relying on netty directly.
- A create-index request (and similar) will now wait until an index is writable before returning. No more waiting for yellow.
- Creating a new index will now turn the cluster health yellow instead of red (unless something goes wrong), allowing sysadmins to sleep peacefully.
- Scroll requests have a safeguard on the maximum
size, similar to the safeguard that already exists for search.
- The cluster allocation explain API can now report on disk usage.
- The Java REST client supports swapping out the internal HTTP client in order to add async support.
- Plugins extending search will need to implement SearchPlugin instead of onModule(SearchModule).
- Jars for plugins and modules which can be used with the TransportClient will be published to Maven.
foreachingest processor now wraps a single processor only, instead of an array of processors. This makes the UI and error reporting simpler.
- Intentionally ignored errors in ingest pipelines are reported in verbose simulate method for easier debugging.
- The cluster-health API can now
wait_for_eventsto ensure that (eg) settings updates have been propagated before proceeding.
- NotMasterException is important enough to be its own exception.
- The profile API was incorrectly counting children profile timings twice. Also, it now reports the number of times important methods are called.
- Settings may no longer be specified using the Java properties syntax.
- InnerHits should use rewritten queries to avoid later exceptions.
- The old script and template syntax deprecated in 1.x has been removed.
- All snapshot restore plugins now use a single deleteBlob() method.
- Improved bootstrap logging to explain to the user why bootstrap checks are being enforced.
- Renaming a REST endpoint and deprecating the old endpoint can be done in a single step with registerWithDeprecatedHandler.
node.localsettings for testing have been removed in favour of setting
- The Template class has been removed in favour of using the Script class directly with lang
templatequery has been moved to the mustache module.
- More aggregations have been migrated to NamedWritable instead of aggregation streams.
- Guice is being removed from Elasticsearch one PR at a time, resulting in a better defined Plugin API.
match_mapping_typesshould be rejected when creating dynamic mappings.
- Creating too many aggregation buckets should trip a circuit breaker.
- The concept of write_consistency is being removed in favour of waiting for a specified number of shards to be available.
- Script access to term statistics was buggy and performed badly. It is being removed in favour of using a custom Similarity.
- The Java REST client will support blocking and async requests.
- Benchmark the new Java REST client vs the TransportClient.
- Indexing requests should not be lost to recovering replicas when the primary shard is relocated.
- Snapshots will use UUIDs instead of names for snapshots and for indices.
- S3 repository plugin will support path style access.
- A CLI tool will allow corrupted translogs to be removed in order to recover data already stored in the index.
- Refactoring variable chains in Painless to make the AST much more natural.
- Nodes will be assigned node names that persist across restarts.
- Netty4 module development is revealing tricky bugs in Netty.
- A new
geofield type could combine points and shapes into a single field.
- Macro and micro benchmarks should be run as part of CI.
CustomAnalyzerwas accidentally switched to use the wrong default attribute factory, but Uwe is fixing it
- Now it's simple to create a
Polygonfrom a GeoJSON string
- Dimensional points now use run-length compression of the leading prefix byte of one of the dimensions, resulting in a nice drop in the index size nightly geo benchmarks (annotation R)
Directory.renameFilewas also doing an
fsyncon every call, which can be very costly if you're renaming many files, which Elasticsearch does when copying a shard
- A problematic assertion inside
- We were missing dimensional points in our backwards compatibility test indices!
FSDirectoryclasses all immediately create the directory but this is problematic for read-only use cases so maybe we can make it optional
- Continuing on the modernization of Lucene's scoring,
queryNormhas now been removed, and we are considering removing the outgunned
TermQueryto delay term lookups until they are truly needed is now simpler and more palatable thanks to recent changes
- Efforts to provide analyzers based on OpenNLP project are progressing after years of dying-on-the-vine
MatchNoDocsQuerynow includes an optional reason for its creation
- The new
Analyzer.normalizemethod normalizes an input string, obsoleting
FastVectorHighlightercan now handle block join queries, but the test case had one
precommit'scode style checks now require the Apache license header above the
packagedeclaration, with a blank line in between, and IntelliJ's configuration has also been fixed
- The new
FilterWeightlets you override specific
Weightmethods of another delegated
- A new
RangeFieldwill index a multi-dimensional range, such as a day range in a calendar, using dimensional points
HunspellStemFilterdoes not behave like
hunspellon the Hungarian text, but this is by design
- A user trying to search Chinese text is struggling with the many changes between Lucene versions 2.3 and 5.0
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!