This Week in Elasticsearch and Apache Lucene - 2016-03-14
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
How does @Zymergen build, test, & analyze DNA mods to microbes at scale? https://t.co/7OUmnHavz8 #Elasticsearch pic.twitter.com/dfPZCPmPwc
— elastic (@elastic) March 8, 2016
Elasticsearch Core
Changes in 2.x:
- The Tribe node now passes an explicit whitelist of settings through to the client nodes which connect to each cluster. Later, plugins will have an extension point for adding plugin-specific settings to the whitelist.
- Any deprecated parameters parsed by ParseFieldMatcher now get deprecation logging for free.
- Trying to close or delete an index while it is being restored will now fail the close/delete request.
- The `lat_lon` and `precision_step` parameters to `geo_point` fields are deprecated as they are no longer configurable with the new geo-point format. The`validate` and `normalize` parameters now have deprecation logging.
- The geo distance and geo range distance queries no longer support the `.geohash` suffix as it is not needed and makes the query ambiguous.
- The `has_child` query now respects the configured similarity.
- Multi-index expressions starting with `*` were ignoring exclude expressions.
Changes in master:
- Index lookups now use the index UUID instead of by name, and index names are resolved to UUID as early as possible.
- `string` fields will be replaced by `text` and `keyword` fields in 5.0, with the following bwc layer:
- String mappings in old indices will not be upgraded.
- Text/Keyword mappings can be added to old and new indices.
- String mappings on new indices will be upgraded automatically to text/keyword mappings, if possible, with deprecation logging.
- If it is not possible to automatically upgrade, an exception will be thrown.
- Norms can no longer be lazy loaded. This is no longer needed as they are no longer loaded into memory. The `norms` setting now take a boolean. Index time boosts are no longer stored as norms.
- Command line settings can no longer use the -- style. Instead, they should be specified with a `-E` prefix.
- Trying to close or delete an index while it is being snapshotted will now fail the close/delete request.
- Scripting engines no longer try to compile hidden files in the script directory.
- The `-XX+AlwaysPreTouch` flag means all memory pages are now committed to memory at startup.
- The deprecated `ignore_unmapped` parameter has been removed from sorting.
- Queries deprecated in 2.0 have now been removed.
- The `multi_field` field datatype, deprecated in 1.0, has been removed.
- The generic thread pool is now bound to 4x the number of processors.
- The `collect_payloads` parameter to `span_near` is deprecated. Payloads are now loaded when needed.
- The cat-recovery API now supports the raw values `bytes_recovered` and `files_recovered`, and the `translog` and `translog_ops` columns have been renamed to be more explicit.
Ongoing changes:
- Dynamic field addition now happens at the end of doc parsing, in preparation for supporting dots in field names.
- The search refactoring is nearing its end with only suggesters, sort, and inner hits outstanding.
- The percolator API will be deprecated in favour of a percolator query, which will deliver a number of requested features to the percolator.
- Once "primary terms" have been added to master, we will be able to enable the acked indexing test.
- The reindex API will support throttling.
- Index data folders will be named according to the index UUID, rather than the index name.
- Storing the cluster UUID in index metadata will allow Elasticsearch to no longer import dangling indices which were deleted while a node was disconnected from the cluster.
Apache Lucene
- The new dimensional points feature for 6.0.0 is getting intense pre-release scrutiny, which is uncovering number of fun bugs and API usability issues, causing us to delay the first 6.0.0 release candidate:Â
PointRangeQuery
'sÂequals
Âmethod was broken, returning false when queries were in fact the same - Sparse points fields were not always handled correctly on merge; a dedicated sparse points test should help uncover any other sparse points issuesÂ
- The copy constructor forÂ
FieldType
 completely ignored points! - A newly addedÂ
Test2BPointValues
,
 to ensure you can index more than 2.1 billion points in a single segment, uncovered an int overflow bug after running for 22 hours - The default codec's points implementation was missing someÂ
checkIntegrity
 calls - The legacyÂ
SlowCompositeReaderWrap<wbr>per,
 an awful class that inefficiently tries to pretend you have only one segment in your index, does not support points, and is now moved out of Lucene's core - TheÂ
SimpleText
 codec falsely failed itsÂCheckIndex
 if a points field has zero points - TheÂ
MIGRATE.txt
 andÂCHANGES.<wbr>txt
 descriptions of dimensional points is better now - TheÂ
newSetQuery
 API now also conveniently accepts aÂCollection
 of boxed values in addition to existing varargs of each primitive type CheckIndex
 forgot to tell you it was in fact checking pointsÂ- Dead code is being removedÂ
- Cutover existing users from legacy numeric fields to the new dimensional points:Â
- The legacy uninvertingÂ
FieldCache
 can now un-invert single-valued points fields - Both the flexible and XML query parsers now support pointsÂ
UninvertingReader
 still needs to support multi-valued pointsÂ- TheÂ
join
 module still needs to support points - TheÂ
spatial-extras
 module still needs to switch to points MemoryIndex
 does not yet support pointsÂ
- The legacy uninvertingÂ
- Lucene's default codec will now also use prefix compression on fixed-width doc values data (e.g. derived fromÂ
InetAddress
 orÂBigInteger
)Â - A newÂ
LatLonPoint.nearest
 method finds the nearest indexed point to a query point, something KD trees excel at, but the latest patch is still vulnerable to adversaries - Spatial3d now exposes only the WGS84 planet model which is the most accurate one it supportsÂ
OfflineSorter
 will be faster in 6.1.0, by reducing unnecessary byte copyingÂ- Group search hits by hamming distanceÂ
- All queries should be immutable since they can be enrolled as a cache keyÂ
ant precommit
 now fails on code comparing already identical values, and also on useless assignmentsÂ- JoinÂ
TopDocs
 by docs while keeping the result ranks - A rare test bug, which only happens when we randomly generate exactly the same bytes already in an index file, is fixedÂ
NRTCachingDirectory
 optionally logs toÂstdout
Â- Codec level encryption remains controversialÂ
- Add doc values support toÂ
MemoryIndex
 - 800+ new top-level-domains have been created since we last fixedÂ
StandardTokenizer
 to detect them! - The nightly smoke tester was confused by newly old back compat indicesÂ
- This test was too evil, taking more than 2 hours to run with just the right seedÂ
PointRangeQuery
 now optimizes the likely common case when all documents will matchÂ- A few missingÂ
s's
 touched a lot of source files - Split out the geo3d math-only APIs under aÂ
geom
 sub-package - A troublesome facets test has been removed since it tested floats (and struggled with 1 ulp differences) when in fact facets only supports doublesÂ
MemoryIndex
 now also acceptsÂIterable
 overÂIndexDoc<wbr>ument
 instead of a documentÂ- MostÂ
FilterX
 classes are now abstractÂ
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!