1 de fevereiro de 2016

This Week in Elasticsearch and Apache Lucene - Cluster Cloning in Hosted Elasticsearch

Clinton Gormley Shaunak Kashyap Michael McCandless

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Top News

Just how easy is it clone your cluster in our hosted #Elasticsearch service? (Hint: very) https://t.co/I3ySOOFz04 pic.twitter.com/M9HoK1KTgw
— elastic (@elastic) January 29, 2016

Elasticsearch Core

Changes in 2.2:

Closures are once again allowed in the Groovy scripting plugin, and a PR has been submitted to Groovy to remove the need for the supressAccessChecks permission.
Translog recovery is fast again.
Command line options now work correctly on Windows.
The Tribe node wasn't passing on a custom --path.conf to its node clients, which resulted in security exceptions.
Query and top-level inner hits results shouldn't overwrite each other.
Geo-shapes didn't work in the percolator if map_unmapped_fields_as_string was enabled.

Changes in 2.x:

Disabled fielddata loading was silently ignored on empty indices.
Throw an exception if the Lucene version is not the expected one.
Include the exception name for not serializable exceptions.

Changes in master:

Shard failure requests for no longer existing shards should always be considered successful.
The term-level fuzzy query is deprecated in favour of the match query with the fuzziness parameter.
Boolean settings in mappings are now strict.
The "index" mapping param now accepts only true/false.
Setting index: false will no longer disable doc values as well.
Doc values are controlled by the doc_values setting only, not by fielddata_format.
Many more global settings have been migrated to the new settings infrastructure.
The new scripting language is called Painless.
Deep pagination is now possible with the search_after parameter.
The ingest node should make deep copies of data structures.
Disabled the ability to fsync on every operation (instead of every request) and only schedule fsync if really needed.
Shards are marked as active during recovery, to ensure the indexing buffer is big enough.
TermVector APIs no longer update mappings.
Tracking of parent tasks now include master node, replication, and broadcast actions.
Load average info has been normalised across different OSes.
Improve exceptions from ingest pipelines.
Ensure all resources are closed when closing a node.

Ongoing:

Work has started on using UUIDs to identify indices on the file system, rather than relying just on index names.
The reindex API is starting to use the task management framework.
Search refactoring continues with the suggesters and sorting. A design bug in aggs refactoring (creating one instance per node instead of per shard) will require quite a big change.
Updating mappings with update_all_types isn't working correctly.

Apache Lucene

We plan to do a 5.5.0 release soon, to get all backported 5.x features out to the world, and to also debug the release process with git, and then get 6.0.0 release started
Java 9 changes the API for un-mapping previously memory mapped pages, but users still risk a SIGSEGV when they try to use an IndexReader after it's closed
The switch from subversion to git has a long wiggling tail: lots of build fixes, including detecting if you changed git branches and forcing a clean build if so to prevent scary looking false test failures; our developer resources page now reflects the switch; we share notes on what long series of git commands seem to work; we fixed our shadow maven build to also switch from subversion to git; and we discuss the joy of merge bubbles
800+ new top-level-domains have been created since we last fixed StandardTokenizer to detect them!
Improved test coverage for the new point values (coming soon in Lucene 6.0.0) has uncovered missing heroics in its exception handling
The new postings-based geo queries are ready to graduate out of the sandbox module, which provides no backwards compatibility
The new divergence from independence similarity continues to wreak havoc on tests
Add a more accurate "does polygon intersect rectangle" method to fix recent test failures uncovered by randomized geo tests
Improve geo tests to confirm that the quantized encoding is stable and its error falls within the claimed tolerance
MemoryIndex now has sugar methods to directly create a MemoryIndex from a document or fields
All TermsQuery constructors are now efficient, avoiding creating lots of temporary Term objects
IndexableFielid.tokenStream no longer throws IOException
Fix the build again to detect if running tests incorrectly results in source code changes!
Scary looking test failures turned out to just be consumers abusing IndexInput by sharing an instance across threads without cloning first
LuceneTestCase now uses standardized language tags to represent the randomized Locale
Some nice performance gains are coming to geo point queries by customizing how terms are created from the geohashes
The points based and postiongs based geo implementations use different encodings with different quantization errors
The "exotic" rectangles selected by point values (BKD tree) still cause problems for the lat/lon 2D geo apis
The complex WordDelimiterFilter sometimes produces incorrect tokens
Should we enable storing a Lucene index in Mongo DB?

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!