This Week in Elasticsearch and Apache Lucene - 2016-05-02
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
We “unlocked” some indexing performance in #elasticsearch github.com/elastic/elasticsearch/pull/18060 Coming to 2.4.0 and 5.0.0!
— Jason Tedor (@jasontedor) April 30, 2016
Elasticsearch Core
Changes in 2.x:
- Switching from a sliced lock to a keyed lock when preventing concurrent updates to the same document results in a 15-20% throughput rate on small metric-based documents.
- The restore API now supports `_all`.
Changes in master:
- Deleted indices leave tombstones in the cluster state to prevent them coming back to life when disconnected nodes rejoin.
- The field stats API now accepts wildcards for field matching, and returns whether a field is searchable and/or aggregtable.
- Completion suggester fields from 2.x will still be usable in 5.x.
- The generic thread pool is now bound, with a max pool size of 128. The queue size remains unbounded.
- The cluster allocation explain API now includes info from the shard stores API about why an existing shard copy may or may not be used for recovery.
- The `set` ingest process gained the `override: true|false` parameter allowing it to set a default value only if none already exists.
- Ingest gained a `date_index_name` processor which understands date-math patterns.
- The top-level inner-hits query syntax has been removed, and bugs in the inline inner-hits syntax have been fixed.
- A new MatchNoDocsQuery means that queries that do not match anything will now explain why.
- ConstructingObjectParser is like ObjectParser, but supports constructing objects whose constructor arguments are mixed in with other arguments.
- Queries on fields marked as index:false now fail.
- The analyze API now accepts `filter` and `char_filter`, just like the analyzer settings.
- The exists() check for settings now handles multi-value settings correctly.
- Azure discovery now has integration tests.
Ongoing:
- After adding IPv6 support, some work is required to add all features back to ip fields.
- Indexing requests will soon be able to wait until their changes are visible.
- Rally next steps: adding full text search benchmark, enable benchmarking with plugins.
- HTTP compression will be enabled by default, as long as the client requests it.
- Nodes will soon have persistent IDs which survive node restarts.
- Snapshot restore is getting a google cloud repository.
- Upgrade AWS SDK and add cloud.aws.s3.throttle_retries setting to avoid socket timeout exceptions when restoring large shards.
Apache Lucene
- Lucene's geo benchmark, which tests a 61 M point subset from OpenStreetMaps for shape filtering, distance sorting, and nearest-neighbor search, is now running in Lucene's nightly benchmarks documenting the impressive gains (for Lucene 6.1.0) in the past few weeks
- There's a sudden interest in optimizing how filters collect hits, since this is a hotspot in point queries, leading to reducing per-hit conditionals, using fewer passes in
LSBRadixSorter,
expl oring different growth factors for the array holding all hits so far, givinggeo2D
points queries their own optimized filter builder,MatchingPoints
and avo id re-computing cardinality of a filter's bitset when possible - Query cache improvements: we should automatically warm new segments based on recently cached queries, reduce lock contention, remove a now unused parameter, re-use whole bitsets that the filter, such as
PointRangeQuery,
already built andMemoryIndex
should never even consider caching queries - Should doc values optimize for the sparse use case?
- Index-time sorting should be better supported in Lucene's core
- Stats for dimensional points fields failed if one or more segments did not index points for the specified field
InetAddressPoint
exposesnextU p/nextDown
APIs to make it easier to work with exclusive boundsInetAddressPoint. newPrefixQuery
was broken ifprefixLength
was not an octetGeo3d
improvements:- Optimize large polygon searches
- Polygon hole intersections were broken, and we now detect if a hole is (illegally) outside its supposedly containing polygon
- Real-world "dirty" polygons were causing problems
Geo3D
should also offer distance sorting- Some fun test failures somehow involve Lagra nge multipliers
Geo2D
(LatLonPoint
andGeoPoin tField
) improvements, including major gains for polygon searches:- Separate doc values from points, so that users can separately choose which function (sorting and/or shape filtering) they need
- Instead of incrementing a counter for every hit, use the
grow
API - Points queries get their own optimized filter builder,
MatchingPoints
- Use faster orientation methods for polygon relations
- Use a balanced interval tree for faster polygon relations, enabling us to remove
LatLonGrid
- Separate latitude/longitude quantization from Morton encoding
- Better random latitude/longitude generation continues to uncover issues
FunctionQuery.explain
was not reporting its boost correctly- Our release scripts should use
cherry-pick
to merge downstream changes - Exotic patterns can cause
PatternReplaceCharFilter
to work too hard - A spooky span queries test failure remains unexplained
ToParentBlockJoinQuery's
explain is lame today and should instead include the explanation from its children- The classic highlighter hits an exception if you use
NGramAnalyzer
and try to highlight aPhraseQuery
- A bug in the latest JDK 9 early access 110 build that broke
analyzers-common
tests has been fixed - Java 9 and our randomized testing library gets angry about leftover static fields in tests
- The XML query parser is conflicted about whether lower and upper bounds on ranges are optional
- A spooky exception in
MoreLikeThisQuery
tests was just a test bug - A new LSH (locality sensitive hashing)
TokenFilter
and query is an alternative to the standardMoreLikeThisQuery
- The flexible query parser does not clone its nodes correctly
- Building massive boolean queries is slow
XMLQueryParser
makes it hard for subclasses to create span queries- Paging is tricky if you use index-time sorting
- Improve
XMLQueryParser
tests
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!