14 March 2016

This Week in Elasticsearch and Apache Lucene - 2016-03-14

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Top News

How does @Zymergen build, test, & analyze DNA mods to microbes at scale? https://t.co/7OUmnHavz8 #Elasticsearch pic.twitter.com/dfPZCPmPwc
— elastic (@elastic) March 8, 2016

Elasticsearch Core

Changes in 2.x:

The Tribe node now passes an explicit whitelist of settings through to the client nodes which connect to each cluster. Later, plugins will have an extension point for adding plugin-specific settings to the whitelist.
Any deprecated parameters parsed by ParseFieldMatcher now get deprecation logging for free.
Trying to close or delete an index while it is being restored will now fail the close/delete request.
The `lat_lon` and `precision_step` parameters to `geo_point` fields are deprecated as they are no longer configurable with the new geo-point format. The`validate` and `normalize` parameters now have deprecation logging.
The geo distance and geo range distance queries no longer support the `.geohash` suffix as it is not needed and makes the query ambiguous.
The `has_child` query now respects the configured similarity.
Multi-index expressions starting with `*` were ignoring exclude expressions.

Changes in master:

Index lookups now use the index UUID instead of by name, and index names are resolved to UUID as early as possible.
`string` fields will be replaced by `text` and `keyword` fields in 5.0, with the following bwc layer:
- String mappings in old indices will not be upgraded.
- Text/Keyword mappings can be added to old and new indices.
- String mappings on new indices will be upgraded automatically to text/keyword mappings, if possible, with deprecation logging.
- If it is not possible to automatically upgrade, an exception will be thrown.
Norms can no longer be lazy loaded. This is no longer needed as they are no longer loaded into memory. The `norms` setting now take a boolean. Index time boosts are no longer stored as norms.
Command line settings can no longer use the -- style. Instead, they should be specified with a `-E` prefix.
Trying to close or delete an index while it is being snapshotted will now fail the close/delete request.
Scripting engines no longer try to compile hidden files in the script directory.
The `-XX+AlwaysPreTouch` flag means all memory pages are now committed to memory at startup.
The deprecated `ignore_unmapped` parameter has been removed from sorting.
Queries deprecated in 2.0 have now been removed.
The `multi_field` field datatype, deprecated in 1.0, has been removed.
The generic thread pool is now bound to 4x the number of processors.
The `collect_payloads` parameter to `span_near` is deprecated. Payloads are now loaded when needed.
The cat-recovery API now supports the raw values `bytes_recovered` and `files_recovered`, and the `translog` and `translog_ops` columns have been renamed to be more explicit.

Ongoing changes:

Dynamic field addition now happens at the end of doc parsing, in preparation for supporting dots in field names.
The search refactoring is nearing its end with only suggesters, sort, and inner hits outstanding.
The percolator API will be deprecated in favour of a percolator query, which will deliver a number of requested features to the percolator.
Once "primary terms" have been added to master, we will be able to enable the acked indexing test.
The reindex API will support throttling.
Index data folders will be named according to the index UUID, rather than the index name.
Storing the cluster UUID in index metadata will allow Elasticsearch to no longer import dangling indices which were deleted while a node was disconnected from the cluster.

Apache Lucene

The new dimensional points feature for 6.0.0 is getting intense pre-release scrutiny, which is uncovering number of fun bugs and API usability issues, causing us to delay the first 6.0.0 release candidate:
- PointRangeQuery's equals method was broken, returning false when queries were in fact the same
- Sparse points fields were not always handled correctly on merge; a dedicated sparse points test should help uncover any other sparse points issues
- The copy constructor for FieldType completely ignored points!
- A newly added Test2BPointValues, to ensure you can index more than 2.1 billion points in a single segment, uncovered an int overflow bug after running for 22 hours
- The default codec's points implementation was missing some checkIntegrity calls
- The legacy SlowCompositeReaderWrap<wbr>per, an awful class that inefficiently tries to pretend you have only one segment in your index, does not support points, and is now moved out of Lucene's core
- The SimpleText codec falsely failed its CheckIndex if a points field has zero points
- The MIGRATE.txt and CHANGES.<wbr>txt descriptions of dimensional points is better now
- The newSetQuery API now also conveniently accepts a Collection of boxed values in addition to existing varargs of each primitive type
- CheckIndex forgot to tell you it was in fact checking points
- Dead code is being removed
Cutover existing users from legacy numeric fields to the new dimensional points:
- The legacy uninverting FieldCache can now un-invert single-valued points fields
- Both the flexible and XML query parsers now support points
- UninvertingReader still needs to support multi-valued points
- The join module still needs to support points
- The spatial-extras module still needs to switch to points
- MemoryIndex does not yet support points
Lucene's default codec will now also use prefix compression on fixed-width doc values data (e.g. derived from InetAddress or BigInteger)
A new LatLonPoint.nearest method finds the nearest indexed point to a query point, something KD trees excel at, but the latest patch is still vulnerable to adversaries
Spatial3d now exposes only the WGS84 planet model which is the most accurate one it supports
OfflineSorter will be faster in 6.1.0, by reducing unnecessary byte copying
Group search hits by hamming distance
All queries should be immutable since they can be enrolled as a cache key
ant precommit now fails on code comparing already identical values, and also on useless assignments
Join TopDocs by docs while keeping the result ranks
A rare test bug, which only happens when we randomly generate exactly the same bytes already in an index file, is fixed
NRTCachingDirectory optionally logs to stdout
Codec level encryption remains controversial
Add doc values support to MemoryIndex
800+ new top-level-domains have been created since we last fixed StandardTokenizer to detect them!
The nightly smoke tester was confused by newly old back compat indices
This test was too evil, taking more than 2 hours to run with just the right seed
PointRangeQuery now optimizes the likely common case when all documents will match
A few missing s's touched a lot of source files
Split out the geo3d math-only APIs under a geom sub-package
A troublesome facets test has been removed since it tested floats (and struggled with 1 ulp differences) when in fact facets only supports doubles
MemoryIndex now also accepts Iterable over IndexDoc<wbr>ument instead of a document
Most FilterX classes are now abstract

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2016-03-14

Top News

Elasticsearch Core

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS