21 de septiembre de 2018

This Week in Elasticsearch and Apache Lucene - 2018-09-21

Paul Sanwald

•

•

•

•

•

Elasticsearch

Highlights

Structured Audit Logging

We have merged structured audit logging to master at the end of last week, and have been working on the backport for 6.5. For 6.x, we have decided to maintain the old (non-structured) audit log alongside the new format, so the backport requires replicating the old format on top of the new code.

Append only indexing and cross-cluster replication

Users of time based indices usually index documents once and never change them again (metrics, logs etc.), relying on ES to generate ids for them. In ES jargon, this indexing pattern is called append only indexing and we have various optimizations in place to support it. For cross-cluster replication, those optimizations are not available to the “following” index because from its perspective the documents already had IDs when they came. Without these optimizations the following index won't be able to keep up.

We have had numerous discussions on how to support those optimizations on the following index and came up with an exciting plan to automatically detect operations that don't change any existing documents on the primary, capture this information and transfer it to the replicas and secondarily to the following shards. This means that this optimizations will automatically be enabled on the following indices, even if the indexing operation originally had an ID and thus following indices may be able to catch up faster.

Java Time migration

We have reworked Joda time backwards compatibility in scripts so both the Joda and the Java time API can be used at the same time in 6.x, and users get an explicit deprecation warning for each method that will disappear on upgrade to 7.0. We introduced a date formatter interface to be able to parse milliseconds since epoch properly; also each date formatter now has a name which is needed for mapping changes.

Performance

We have released Rally 1.0.1 with lots of stability fixes but also a few features like a new shrink index API.

Types removal

We have resumed work on types removal, coming up with a plan for REST tests migration, and moving tests and docs to typeless APIs.

Annotated text field type

We have merged work on a new field type, annotated_text. The purpose of this field type is to allow identification of entities in documents, without the use of synonyms. This is done as a plugin to elasticsearch, includes a highlighter.

Token Filters

We have added two scriptable token filters: condition, which will only apply a given filter subchain if the current token matches a predicate; and predicate_token_filter, which will strip out tokens that don’t match a predicate. We will add a further scriptable filter that allows you to edit the byte content of a token, which should reduce the need to write Java plugins for custom analysis chains.

We have also reworked how token filters that refer to other filters, or to their predecessors in the token chain, are constructed. This means that synonym filters now work correctly with conditional or multiplexer filters.

Changes

Changes in 6.4:

Allow to clear the fielddata cache per field #33807
Watcher: Ensure triggered watch deletion is sync #33799
Do not override named S3 client credentials #33793
Skip rebalancing when cluster_concurrent_rebalance threshold reached #33329

Changes in 6.5:

Introduce a search_throttled threadpool #33732
Restore local history from translog on promotion #33616
HLRC: Add support for reindex rethrottling #33832
Update geolite2 database in ingest geoip plugin #33840
ingest: support simulate with verbose for pipeline processor #33839
SQL: Fix ANTL4 Grammar ambiguities. #33854
HLRC: Reindex should support requests_per_seconds parameter #33808
SQL: TRUNCATE and ROUND functions #33779
Add contains method to LocalCheckpointTracker #33871
Create a WatchStatus class for the high-level REST client. #33527
Move CompletionStats into the Engine #33847
Fix potential NPE in _cat/shards/ with partial CommonStats #33858
Allow TokenFilterFactories to rewrite themselves against their preceding chain #33702
[Tests] Nudge wait time in RemoteClusterServiceTests #33853
Fixing assertions in integration test #33833
SQL: Fix issue with options for QUERY() and MATCH(). #33828
Move DocsStats into Engine #33835
Use the global doc id to generate random scores #33599
Profiler: Don’t profile NEXTDOC for ConstantScoreQuery. #33196
SQL: Better handling of number parsing exceptions #33776
Ensure fully deleted segments are accounted for correctly #33757
SQL: Grammar tweak for number declarations #33767
Watcher: Use Bulkprocessor in HistoryStore/TriggeredWatchStore #32490
Dependencies: Update javax.mail in watcher to 1.6.2 #33664
Add create rollup job api to high level rest client #33521
Implement xpack.monitoring.elasticsearch.collection.enabled setting #33474
[Monitoring] Removing unused version.* fields #33584
DiskThresholdDecider#canAllocate can report negative free bytes #33641
Only notify ready global checkpoint listeners #33690
Structured audit logging #31931
[Kerberos] Add realm name & UPN to user metadata #33338
Deprecate negative weight in Function Score Query #33624
SQL: Return functions in JDBC driver metadata #33672

Changes in 7.0:

BREAKING: Add minimal sanity checks to custom/scripted similarities. #33564
BREAKING: Core: Default node.name to the hostname #33677
add RemoveCorruptedShardDataCommand #32281
TESTS: Set SO_LINGER = 0 for MockNioTransport #32560
Ensure realtime _get and _termvectors don’t run on the network thread #33814
New Annotated_text field type #30364
Don’t count hits via the collector if the hit count can be computed from index stats. #33701

Apache Lucene

Indexed shapes

After benchmarking BKD-based geo shapes, we were a bit disappointed by their performance, but we looked into it and found out that we had quite some room for improvement!

One issue is that BKD trees are designed to index points. The goal of the BKD tree is to split the space into cells that contain equal numbers of points in such a way that filtering by any dimension or combination of dimensions is efficient. When indexing shapes and ranges, a better goal is to minimize intersection between cells, which is what R-trees do. Unfortunately, naively indexing coordinates of a triangle as a 6D point in a BKD tree doesn't split the space efficiently. Nick is proposing to improve BKD trees so that only a subset of their dimensions could be indexed. The underlying idea would be that instead of indexing triangles as 6D points, we would index the minimum bounding rectangle of the triangle as a 4D point, and remaining information that would allow to reconstruct the triangle in 3 additional dimensions that wouldn't be indexed. This makes the BKD tree essentially behave like a R-tree and gave a 46% speedup to shape search in the benchmark.

Another issue is that the bounding box of points for a sub-tree that the BKD tree records in its index is sometimes larger than the minimum bounding rectangle of those points, especially if there is correlation between dimensions which is common when storing ranges or shapes. Ignacio addressed this performance issue by making sure to record the minimum bounding rectangle that contains all points on every leaf of the BKD tree. Again this brought significant speedups, up to 10x in some cases.

Other

We are contributing an Arabic stemmer
We are exploring separating features from queries.
We propose to remove LowercaseTokenizer, which would in-turn help improve the normalization API.

Elasticsearch Platform

ELK Stack

Elastic Cloud

Observability

Security

Search

Por industria

Por solución

Cliente destacado

Desarrolladores

Conéctate

Conoce

Ayuda

Ve qué está sucediendo en Elastic

This Week in Elasticsearch and Apache Lucene - 2018-09-21

Elasticsearch

Highlights

Structured Audit Logging

Append only indexing and cross-cluster replication

Java Time migration

Performance

Types removal

Annotated text field type

Token Filters

Changes

Changes in 6.4:

Changes in 6.5:

Changes in 7.0:

Apache Lucene

Indexed shapes

Other

Síguenos

Conócenos

Únete a nosotros

Prensa

Socios

Confianza y seguridad

Relaciones con inversionistas

EXCELLENCE AWARDS