This Week in Elasticsearch and Apache Lucene - 2018-02-05
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Elasticsearch SQL Plugin
The SQL plugin should merged into master this week. It provides a full blown (alpha) SQL engine (that does parsing, analysis and optimization) that supports read-only queries against Elasticsearch indices. Features include:
- Projections (SELECT),
- filtering (WHERE),
- sorting (ORDER BY),
- grouping (GROUP BY) including filtering (HAVING),
- scalar (ABS, SIN, COS, ...) and aggregate (MAX, MIN, AVG, ...) functions and arbitrary match (SELECT MAX(salary)-MIN(salary)/COUNT(*) + 4) are supported,
- and also full-text search (QUERY, MATCH)
The plugin will ship with three drivers:
- Rest/HTTP, which accepts an SQL query wrapped in JSON and returns results in JSON - could be used by Canvas or Kibana
- CLI, which provides a command line interface/text interface
- JDBC, for interfacing with Java applications
An ODBC driver is in the works.
New option to control whether partial results are allowed in search requests #27435
When executing a search today, we return results from as many shards as we can, and we include a
_shards section in the result body to indicate how many shards should have been searched and how many shards were actually searched. Reasons for a shard failing to return results include:
- The search times out on the shard
- An error occurs executing the search for the shard (including errors like a missing geo or nested field)
- The shard is red and there are no allocated shard copies that the search can be performed on
To explain the reasoning for this, imagine you are retrieving social-media-style updates from a user's friends: if one shard is down, it may be better to display some updates instead of showing none at all. However, this logic could be the wrong choice when showing (eg) analytics - showing a graph of total visits based on partial data is just wrong, and it relies on the user checking the
_shards section in the response (which almost nobody does) to know whether they are seeing meaningful results or not.
We have added a new query parameter to the search API called
allow_partial_results which controls whether results should still be returned if the search fails for any reason on one or more shards. When set to
true partial results are allowed and results will be returned even if not all shards successfully completed. If the parameter is set to
false, an exception is thrown if the search fails on any shard.
The default is currently
true, and there is an on-going discussion about whether we should consider changing the default to
false in the future. This decision is contentious because it is a big breaking change and allowing partial results might not always be a bad decision. For instance, imagine a user has a field
foo which is mapped as a
text field on older indices and a
keyword field in newer indices. Today, Kibana can run a terms aggregation on the newer indices and just ignore the exceptions from the older indices with the incorrect mapping. Perhaps there is another way of solving this particular issue while still benefiting from
Changes in 5.6:
- REST high-level client: Fix parsing of script fields #28395
- [Security] Clear Realm Caches on role mapping health change #3782
Changes in 6.2:
- Watcher: Ensure state is cleaned properly in watcher life cycle service #3770
Changes in 6.3:
- BREAKING: Add a shallow copy method to aggregation builders #28430
- Search - new flag: allow_partial_search_results #27906
- Add ability to index prefixes on text fields #28290
- Move persistent tasks to core #28455
- Allows failing shards without marking as stale #28054
- Scripts: Fix security for deprecation warning #28485
- Forbid trappy methods from java.time #28476
- Synced-flush should not seal index of out of sync replicas #28464
- Replicate writes only to fully initialized shards #28049
- Remove Painless Type From Locals, Variables, Params, and ScriptInfo #28471
- Remove RuntimeClass from Painless Definition in favor of Painless Struct #28486
- Remove Painless Type From Painless Method/Field #28466
- Remove Painless Type in favor of Java Class in FunctionRef #28429
- Remove Painless Type from e-nodes in favor of Java Class #28364
- Further Removal of Painless Type from Lambdas and References. #28433
- Add lower bound for translog flush threshold #28382
- REST high-level client: add support for split and shrink index API #28425
- Add support for indices exists to REST high level client #27384
- Add ranking evaluation API to High Level Rest Client #28357
- Java high-level REST : minor code clean up #28409
- Do not take duplicate query extractions into account for minimum_should_match attribute #28353
- Fix AIOOB on indexed geo_shape query #28458
- Replace Bits with new abstract class (#24088) #28334
- Suppress assertions about rounding of times near overlapping days #28151
- XContent: Factor deprecation handling into callback #28449
Changes in 7.0:
- BREAKING: remove deprecated percolator map_unmapped_fields_as_string setting #28060
- Add allow_partial_search_results flag to search requests with default setting true #28440
- BREAKING: Remove tribe node support #28443
- BREAKING: Remove all tribe related code, comments and documentation #3784
Multi-release JAR to take advantage of Java 9 optimizations
After several performance testing iterations to better understand potential impacts of this change, Lucene will now build a multi-release JAR in order to take advantage of some new APIs introduced in Java 9 like
Arrays.mismatch, which can't be implemented as efficiently with Java 8.
The build still works with Java 8: this change is implemented through two new classes
FutureArrays which are functionally compatible with Java9's
Arrays. Then the build creates the Java9 classes with ASM by remapping calls to
FutureArrays with calls to
We now need to double down on testing with both Java 8 and Java 9 since different code might run depending on the JVM version.
- Indexing impacts didn't hurt indexing throughput, term queries and disjunctions, but it did slow down conjunctions a bit (~4%), and CheckIndex significantly (more than 2x).
- UnifiedHighlighter could expose the raw offsets of matches.
- GeoPolygonFactory sometimes fails to recognize convex polygons.
- The ALWAYS_CACHE policy us useful for testing, but we should remove it from the public API as it would do more harm than good due to the fact that it would likely cache filters that are not reused.
- NRTCachingDirectory has unnecessary leniency in case the file to create already exists.
- We don't always consume doc-value iterators as efficiently as we could.
- We should better contain logic that is only necessary to have a clear exception when opening pre-5.0 Lucene indices.
- CheckIndex needs to better validate that doc-value iterators behave consistently.
- Disallowing to change index options on the fly will help fix a relevancy bug.
- DirectSpellChecker needs to better validate parameters.
- ShingleFilter should be improved to support synonyms.
- SpanBoostQuery serves no purpose and should be removed.