This Week in Elasticsearch and Apache Lucene - 2016-04-11
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
The alpha-1 of the #ElasticStack is here! If you’re exploring it, you’re an Elastic Pioneer — & we’ve got good news:
— elastic (@elastic) April 7, 2016
Changes in 2.x:
- Extended Stats could return the wrong result when some indices are missing a field.
- Adding an object field with the same name as an existing field should fail.
- Shadow replicas should be considered as having size zero.
- CORS was broken for preflight requests.
- Windows users can configure the Windows service name, description, and user.
- Network addresses are now consistently displayed as the ip:port, instead of the hostname.
Changes in master:
- Network partitions will no longer cause loss of in flight documents, and we have the test to prove it.
- The `match`, `match_phrase`, and `match_phrase_prefix` queries are now separate queries, not just types of the `match` query.
- The task manager response now tells you which tasks can be cancelled, and supports a `_cat/tasks` API.
- Elasticsearch will no longer accept unquoted field names in JSON.
- Elasticsearch now uses mmapfs for Lucene directories instead of a hybrid of niofs/mmapfs.
- ParseField is now used to parse query names, which comes with deprecation logging for free.
- Geo-points support ignore_malformed correctly again.
- Moving averages threw an NPE when no window was specified.
- MappedFieldType should be responsible for knowing about which formatter apply, rather than the agg framework.
- The allocation-explain API now includes the configured allocation_delay and remaining_delays times.
- Hot threads now fail hard if the JVM doesn't support them.
- Queries now have a registry, and queries have gradually been migrated to use it.
- Bulk request sizes will be subject to a circuit breaker.
- Deleted index tombstones are complicated.
- ObjectParser should allow constructor args.
- Should we enable http compression by default?
- Numeric and date fields in 5.0 should use the new Lucene points API.
- Now that we have removed the percolator API, we should also remove the percolator type and use percolator fields instead.
- Improvements to how we score the _all field based on per-field boosts.
- The 6.0.0 release vote has passed and the bits were set free a few hours ago! Thank you Nick Knize for taking on the challenging role of release manager!
geo3dimprovements this week:
- Polygon queries now accept
Polygon...inputs, including random nested test polygons, matching our geo2d implementations and respecting the order of polygon vertices
Geo3dseems to sometimes incorrectly think a polygon is concave when it's really convex
- Adjacent polygon points can now be coplanar
- The unique
GeoPathsupport, which matches all point within X distance of a specified path (think road trip, looking for sushi nearby), now has a simple factory API as well
- Tests were not adequately testing the new simple factory methods for common shapes
Geo3dnow uses a similar encode/decode quantization approach as
- After lively discussions,
geo3dAPIs no longer publicly expose classes and methods that could safely be private. APIs should start life private until proven worthy of being public!
- Polygon queries now accept
- Many geo2d improvements as well:
LatLonPointPolygon queries are faster using a cool pixelating grid approach, and we can do the same for
- We must improve debuggability of our geo test failures with nice 3D earth models like this example
- Here's a lively discussion about the pros and cons of having our geo tests quantize data only once
- Quantization issues are tricky, and geo2d queries were quantizing the edges of box queries incorrectly, resulting in false positive hits
- We have improved the geo2d tests to never allow "tolerance" on the returned results
- We have moved common geo encoding APIs to core so they can be shared across implementations
- Better random latitude/longitude generation for tests has exposed a tie-break bug in distance sorting, edge case bugs in box query, test bugs and polygon bugs
Polygonclasses have graduated into Lucene's core, to enable sharing across our numerous geo implementations
- A new encoding for
GeoPointFieldwill be consistent with
LatLonPoint,and use all 64 available bits to minimize quantization error
GeoPointFieldgets an efficient distance sort
- Randomized tests tried to create a too-big
- We will move
BaseGeoPointTestCasefrom the spatial module to
test-frameworkallowing us to remove the dependency of the sandbox module on spatial
SloppyMath.haversincan now move to
- The classification module now computes the f1-measure
- A previously commented out test assertion comes half way back to life
- Our "getting started with Lucene" docs were a bit buggy, but now fixed thanks to a user asking about it
- We've upgraded our randomizedtesting dependency to 2.3.4, so we get better messages when there is a static leak in our tests
- Points were missing from the
DataSplitterin Lucene's classification module should pay attention to classes when splitting
- 800+ new top-level-domains have been created since we last fixed
StandardTokenizerto detect them, but we may need to wait for a JFlex release
- Highlighting fails to find terms inside the child query of a
- Lucene doesn't have direct support for boolean subset matching, but a number of possible workarounds may work
Math.toRadiansis changing its results slightly between Java 1.8 and 1.9
- A scary random test failure is hopefully caused by bad hardware or buggy JVM
TestCoreParsergets some small improvements
- A possibly new JVM bug causes JVM crash when decoding postings
JapaneseTokenizershould do a better job validating custom user-provided dictionaries
- Another iteration for codec level encryption; this patch uses a new initialization vector for each data block, and seems not to impact search performance
- Our release scripts still struggle with the switch from Subversion to git
BooleanQuery'sexplain method can lie about its score
- Another user falls into the unfortunately common trap of thinking Lucene's stored fields store all information about a field
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!