This Week in Elasticsearch and Apache Lucene - 2016-03-07
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Wondering why queries don't always work? @gmoskovicz dives into the details of phrase-matching in #Elasticsearch: https://t.co/b5N96J246P
— elastic (@elastic) March 7, 2016
Elasticsearch Core
Changes in 2.x:
- Debian's init script was not waiting for the pidfile.
- GCE Discovery plugin was missing permissions and tests.
- Index deletions missed by disconnected nodes will no longer be re-imported when the node rejoins.
- Terms queries are now considered costly, which means they will be cached more eagerly.
- A has_parent query on non-parent types no longer causes an NPE.
- Update mapping should update the metadata for all affected types.
- Speeded up shard allocator when using include/exclude shard allocation rules.
- Fixed a bug with empty buckets in the stats aggregator.
- Azure Storage client upgraded to 4.0.0.
- Deprecation logs added for:
- Use of old script/template syntax
- Use of multicast-plugin
- Use of _source transform mapping
- Use of deprecated queries
Changes in master:
- Upgrade to Lucene 6 snapshot.
- Reindex API has landed, and now supports ingest pipelines.
- The index stats API now supports the include_segment_file_sizes to report on how much disk space is used by each Lucene file.
- Ingest nodes and available processors are reported by nodes info and by the _cat/nodes API, and the ingest_took time is available in bulk requests.
- Ingest metadata now uses cluster state diffs for lighter weight updates.
- Usage of guice has been reduced by removing DiscoveryService.
- Cygwin is not tested and not supported, so a cygwin block in bin/elasticsearch has been removed.
- Replacing string fields with text/keyword fields:
- Doc values no longer controlled by fielddata parameter
- String fields are deprecated in favour of text/keyword fields
- The mapper attachment plugin has been deprecated in favour of the ingest attachment plugin.
- Client nodes are no longer special, and will be connected to (and report stats) like all other nodes.
- Bootstrap checks are now in their own class and are enforced if networking is configured. The check for file handles is lower on OS/X because of the difficulty of setting it and the low likelihood of using OS/X in production. Checks added for max processes and that mlockall was successful.
- The _optimize end point has been removed in favour of _forcemerge.
- Index-time field boosting is now applied as a query time boost, and payloads for per-field boosts in the _all field now use 1 byte instead of 4.
- The shard writeLockTimeout is no longer required.
Ongoing:
- Rewrite range queries to match_all/match_none where the range covers all or none of the docs in a shard for better result caching.
- You shouldn't be able to delete or close an index while it is being restored.
- Keyword fields should support limited analysis.
- Removing node.client setting in favour of setting other node roles to false.
- Add ingest stats to node stats API.
Apache Lucene
- Both 6.x and 6.0.x branches are now cut, requiring fun changes to switch Lucene's master branch to a 7.x world, including the rare but exciting time whenÂ
TestBackwardsCompatibilit<wbr>y
 has no indices to test! - Point values finally support earth surface distance queries, with a delightfully simple and accurate implementation, allowing for exact accuracy for testing (no fuzz!) besides quantization error. It also has performance on par withÂ
GeoPointsDistanceQuery,
Âdespite potent possible future optimizations if we can make the 2D geo math more accurate. - Merging 2D point values across segments is suddenly 21% fasterÂ
MultiPhraseQuery
 is now immutableÂ- Clean up the overlapping methods inÂ
NumericUtils
 vsÂLeg<wbr>acyNumericUtils,
 and bring back lost test cases - Add missing getters for various queriesÂ
- Optimize point range queries that match all documents, likely a common case in time-based indicesÂ
- Point values now exposeÂ
size
 andÂdocCount
Âstatistics per field, for example letting us compute whether a point field is multi-valued, in addition to the existing per-dimension global min and max values - TheÂ
spatial4j
 dependency for theÂspatial-extras
 module is now upgraded to version 0.6 - The semantics of point values intersect API is now sharper: in the 1D case, all points are visited in orderÂ
- The new (in 6.0) point queries get a simpler API
- Duplicate code fromÂ
NumericUtils
 is removed - Uwe tweaksÂ
TSTLookup
 to dodge an old javac compiler bug - The usefulÂ
checkReader,
 used in many Lucene tests, was failing to check points LatLonPoint
 API becomes simplerÂ- Sometimes randomized tests are a bit too evilÂ
RandomCodec
 now also randomizes the points formatÂ- The legacy spatial code, with optional externalÂ
spatial4j
 dependency, has moved to a new spatial-extras
 module, but required some javadocs hacks since the same package name appears in two modules now - Improve randomized testing for the new point distance queryÂ
- Don't try to estimate match count while collecting: it's inaccurate in multi-valued cases, and doesn't seem to help performanceÂ
- The sometimes costlyÂ
TermsQuery
 and point queries are now cached more aggressively - Make it easier to understand why your environment prevents Lucene'sÂ
MMapDirectory
 unmap hack from working - Lucene now always sorts in unicode order, allowing us to consolidate and remove some of the the numerousÂ
BytesRef
 comparator APIs - The debate rages on about how to refactor the spatial3d moduleÂ
- Global ordinals query time join does not explain itself very wellÂ
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!