11 December 2017

This Week in Elasticsearch and Apache Lucene - 2017-12-11

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Soft limit in Aggregations for max number of buckets

A new search.max_buckets dynamic cluster setting in 7.0 tracks the number of buckets created in an aggregation and fails the request if this number reaches the limit.

The default for 7.0 is set to 10000, this means that any request that tries to return more than this number will fail. The number of buckets is checked in the coordinating node and also in each shard when we build the response. This setting should help to reject bad requests that were not caught by the circuit breaker but could push memory use over the limit.

This setting will also be present in 6.2 but it will be disabled by default. Requests in this version that hit the default limit (10,000) will log a deprecation in order to prepare users for the migration to 7.0.

Mappings now filtered by X-Pack Security

X-Pack security will now filter the mappings fields returned by get index, get mappings, get field mappings and field capabilities APIs. This means that fields that a user cannot access due to field-level security will no longer be returned from these APIs.

Removing wildcard and archived settings

Since 5.0, it has been possible to remove index or cluster settings by setting them to null. This should have worked for wildcard settings too (eg foo.bar.*) but a bug prevented that from working correctly. Finally, unknown settings from 2.x indices/cluster were moved to the archived: namespace, but it was impossible to delete archived.* settings from indices as this was rewritten to index.archived.*. Both of these issues have been fixed in 5.6.

JVM Options syntax change

The jvm.options file syntax has changed in 6.2 in order to support a breaking change in command line arguments in Java 9. Each option now needs to be preceded by a JVM major version or version range.

Changes in 5.6:

Fix routing with leading or trailing whitespace #27712
Allow index settings to be reset by wildcards #27671
Prevent constructing an index template without index patterns #27662
X-Pack:
- Grant Netty necessary permissions #3247
- Watcher: Ensure watcher thread pool size is reasonably bound #3056
- Watcher: Fix pagerduty action to send context data #3185

Changes in 6.0:

Only fsync global checkpoint if needed #27652
Obey translog durability in global checkpoint sync #27641
X-Pack:
- Do not enforce TLS if discovery type is single-node #3245

Changes in 6.1:

Correct two equality checks on incomparable types #27688
Include internal refreshes in refresh stats #27615
Catch InvalidPathException in IcuCollationTokenFilterFactory #27202
X-Pack:
- [Security] Don’t deprecate certgen in 6.1 #3201

Changes in 6.2:

Do not open indices with broken settings #26995
Implement byte array reusage in NioTransport #27696
Cleanup split strings by comma method #27715
Add read timeouts to http module #27713
Add missing s to tmpdir name #27721
Add Open Index API to the high level REST client #27574
Added Create Index support to high-level REST client #27351
Fix issue where the incorrect buffers are written #27695
Extend JVM options to support multiple versions #27675
Introduce resizable inbound byte buffer #27551
Add a new cluster setting to limit the total number of buckets returned by a request #27581
Add validation of keystore setting names #27626
Add support for filtering mappings fields #27603
[Geo] Add Well Known Text (WKT) Parsing Support to ShapeBuilders #27417
Add node name to thread pool executor name #27663
Simplify rejected execution exception #27664
Add msearch api to high level client #27274
Detect mktemp from coreutils #27659
Add explicit coreutils dependency #27660
Fix term vectors generator with keyword and normalizer #27608
Fix highlighting on a keyword field that defines a normalizer #27604
Tighten the CountedBitSet class - Followup of #27547 #27632
X-Pack:
- [Security] Make generated passwords safe to be used in shell scripts #3253
- Add API for SSL certificate information #3088

Changes in 7.0:

X-Pack:
- Filter mappings fields when field level security is configured #3173

Apache Lucene

Lucene 7.2.0

We are fixing some last-minute bugs before building the first release candidate. We should hopefully have a release in the coming weeks.

Dynamic pruning of non-competitive hits

LUCENE-4100 was merged, allowing for great speedups when only the top matches are needed. This optimization requires scorers to expose a maximum score that they may contribute and currently only works well with BM25, but we are working on fixing other similarities to work well with this optimization and improving their explanations.

We are also looking into how we could record the per-term maximum term frequencies, or maybe even the per-norm per-term maximum term frequencies in order to get better upper bounds of the produced scores, which would in-turm make the optimization more efficient.

It turns out the same API could be used to speed up phrase queries, when the term term frequency of the most frequent term is not high enough to produce a competitive score.

Finer-grained flushing

At index time, documents are first buffered in an in-memory index buffer. Seeing it as a single buffer is a bit of a simplification though: there is actually a set of index buffers. This helps with concurrency since different threads can write to different index buffers concurrently. When refreshing, each index buffer writes a segment.

In order to make multi-tenancy easy, Elasticsearch adds an abstraction layer on top of Lucene called IndexingMemoryController which tries to make sure that each shard can use as much memory as possible for these buffers at index-time, because it makes indexing faster, while also ensuring that the total amount of memory that is spent on the index buffers across shards does not exceed indices.memory.index_buffer_size.

The issue is that when this shared limit is reached, Elasticsearch tells Lucene to do a refresh, which writes _all_ per-thread buffers to disk. This new feature is a way to tell Lucene to release some of the memory it spends on indexing, only flushing one of the largest per-thread buffers. This means we will write larger index buffers on average, which should make indexing more efficient.

Other

GeoExactCircle may create invalid shapes for planet models with high flattening.
IndexWriter will now allow to perform indexing and flushing on different threads.
The way that FSTs expand nodes into fixed-size arrays to speed up lookups can make them space-inefficient.
A new arabic stemmer is being contributed and should provide better relevance.
Running a fsync of the directory metadata before renaming a file sounds useful even though it's unclear which filesystems it will help.

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2017-12-11

Apache Lucene

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS