05 February 2018

This Week in Elasticsearch and Apache Lucene - 2018-02-05

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Elasticsearch SQL Plugin

The SQL plugin should merged into master this week. It provides a full blown (alpha) SQL engine (that does parsing, analysis and optimization) that supports read-only queries against Elasticsearch indices. Features include:

Projections (SELECT),
filtering (WHERE),
sorting (ORDER BY),
grouping (GROUP BY) including filtering (HAVING),
scalar (ABS, SIN, COS, ...) and aggregate (MAX, MIN, AVG, ...) functions and arbitrary match (SELECT MAX(salary)-MIN(salary)/COUNT(*) + 4) are supported,
and also full-text search (QUERY, MATCH)

The plugin will ship with three drivers:

Rest/HTTP, which accepts an SQL query wrapped in JSON and returns results in JSON - could be used by Canvas or Kibana
CLI, which provides a command line interface/text interface
JDBC, for interfacing with Java applications

An ODBC driver is in the works.

New option to control whether partial results are allowed in search requests #27435

When executing a search today, we return results from as many shards as we can, and we include a _shards section in the result body to indicate how many shards should have been searched and how many shards were actually searched. Reasons for a shard failing to return results include:

The search times out on the shard
An error occurs executing the search for the shard (including errors like a missing geo or nested field)
The shard is red and there are no allocated shard copies that the search can be performed on

To explain the reasoning for this, imagine you are retrieving social-media-style updates from a user's friends: if one shard is down, it may be better to display some updates instead of showing none at all. However, this logic could be the wrong choice when showing (eg) analytics - showing a graph of total visits based on partial data is just wrong, and it relies on the user checking the _shards section in the response (which almost nobody does) to know whether they are seeing meaningful results or not.

We have added a new query parameter to the search API called allow_partial_results which controls whether results should still be returned if the search fails for any reason on one or more shards. When set to true partial results are allowed and results will be returned even if not all shards successfully completed. If the parameter is set to false, an exception is thrown if the search fails on any shard.

The default is currently true, and there is an on-going discussion about whether we should consider changing the default to false in the future. This decision is contentious because it is a big breaking change and allowing partial results might not always be a bad decision. For instance, imagine a user has a field foo which is mapped as a text field on older indices and a keyword field in newer indices. Today, Kibana can run a terms aggregation on the newer indices and just ignore the exceptions from the older indices with the incorrect mapping. Perhaps there is another way of solving this particular issue while still benefiting from allow_partial_results=false.

Changes in 5.6:

REST high-level client: Fix parsing of script fields #28395
X-Pack:
- [Security] Clear Realm Caches on role mapping health change #3782

Changes in 6.2:

X-Pack:
- Watcher: Ensure state is cleaned properly in watcher life cycle service #3770

Changes in 6.3:

BREAKING: Add a shallow copy method to aggregation builders #28430
Search - new flag: allow_partial_search_results #27906
Add ability to index prefixes on text fields #28290
Move persistent tasks to core #28455
Allows failing shards without marking as stale #28054
Scripts: Fix security for deprecation warning #28485
Forbid trappy methods from java.time #28476
Synced-flush should not seal index of out of sync replicas #28464
Replicate writes only to fully initialized shards #28049
Remove Painless Type From Locals, Variables, Params, and ScriptInfo #28471
Remove RuntimeClass from Painless Definition in favor of Painless Struct #28486
Remove Painless Type From Painless Method/Field #28466
Remove Painless Type in favor of Java Class in FunctionRef #28429
Remove Painless Type from e-nodes in favor of Java Class #28364
Further Removal of Painless Type from Lambdas and References. #28433
Add lower bound for translog flush threshold #28382
REST high-level client: add support for split and shrink index API #28425
Add support for indices exists to REST high level client #27384
Add ranking evaluation API to High Level Rest Client #28357
Java high-level REST : minor code clean up #28409
Do not take duplicate query extractions into account for minimum_should_match attribute #28353
Fix AIOOB on indexed geo_shape query #28458
Replace Bits with new abstract class (#24088) #28334
Suppress assertions about rounding of times near overlapping days #28151
XContent: Factor deprecation handling into callback #28449
X-Pack:

Watcher: Add support for scheme in proxy configuration #3614
XContent: Adapt to new method on parser #3797
[Security] Correct DN matches in role-mapping rules #3704

Changes in 7.0:

BREAKING: remove deprecated percolator map_unmapped_fields_as_string setting #28060
Add allow_partial_search_results flag to search requests with default setting true #28440
BREAKING: Remove tribe node support #28443
X-Pack:
- BREAKING: Remove all tribe related code, comments and documentation #3784

Apache Lucene

Multi-release JAR to take advantage of Java 9 optimizations

After several performance testing iterations to better understand potential impacts of this change, Lucene will now build a multi-release JAR in order to take advantage of some new APIs introduced in Java 9 like Objects.checkIndex and Arrays.mismatch, which can't be implemented as efficiently with Java 8.

The build still works with Java 8: this change is implemented through two new classes FutureObjects and FutureArrays which are functionally compatible with Java9's Objects and Arrays. Then the build creates the Java9 classes with ASM by remapping calls to FutureObjects/FutureArrays with calls to Objects/Arrays.

We now need to double down on testing with both Java 8 and Java 9 since different code might run depending on the JVM version.

Other

Indexing impacts didn't hurt indexing throughput, term queries and disjunctions, but it did slow down conjunctions a bit (~4%), and CheckIndex significantly (more than 2x).
UnifiedHighlighter could expose the raw offsets of matches.
GeoPolygonFactory sometimes fails to recognize convex polygons.
The ALWAYS_CACHE policy us useful for testing, but we should remove it from the public API as it would do more harm than good due to the fact that it would likely cache filters that are not reused.
NRTCachingDirectory has unnecessary leniency in case the file to create already exists.
We don't always consume doc-value iterators as efficiently as we could.
We should better contain logic that is only necessary to have a clear exception when opening pre-5.0 Lucene indices.
CheckIndex needs to better validate that doc-value iterators behave consistently.
Disallowing to change index options on the fly will help fix a relevancy bug.
DirectSpellChecker needs to better validate parameters.
ShingleFilter should be improved to support synonyms.
SpanBoostQuery serves no purpose and should be removed.

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2018-02-05

Apache Lucene

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS