11 December 2017

This Week in Elasticsearch and Apache Lucene - 2017-12-11

By Clinton GormleyAdrien Grand

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Soft limit in Aggregations for max number of buckets

A new search.max_buckets dynamic cluster setting in 7.0 tracks the number of buckets created in an aggregation and fails the request if this number reaches the limit.

The default for 7.0 is set to 10000, this means that any request that tries to return more than this number will fail. The number of buckets is checked in the coordinating node and also in each shard when we build the response. This setting should help to reject bad requests that were not caught by the circuit breaker but could push memory use over the limit.

This setting will also be present in 6.2 but it will be disabled by default. Requests in this version that hit the default limit (10,000) will log a deprecation in order to prepare users for the migration to 7.0.

Mappings now filtered by X-Pack Security

X-Pack security will now filter the mappings fields returned by get index, get mappings, get field mappings and field capabilities APIs. This means that fields that a user cannot access due to field-level security will no longer be returned from these APIs.

Removing wildcard and archived settings

Since 5.0, it has been possible to remove index or cluster settings by setting them to null. This should have worked for wildcard settings too (eg foo.bar.*) but a bug prevented that from working correctly. Finally, unknown settings from 2.x indices/cluster were moved to the archived: namespace, but it was impossible to delete archived.* settings from indices as this was rewritten to index.archived.*. Both of these issues have been fixed in 5.6.

JVM Options syntax change

The jvm.options file syntax has changed in 6.2 in order to support a breaking change in command line arguments in Java 9. Each option now needs to be preceded by a JVM major version or version range.

Changes in 5.6:

  • Fix routing with leading or trailing whitespace #27712
  • Allow index settings to be reset by wildcards #27671
  • Prevent constructing an index template without index patterns #27662
  • X-Pack:
    • Grant Netty necessary permissions #3247
    • Watcher: Ensure watcher thread pool size is reasonably bound #3056
    • Watcher: Fix pagerduty action to send context data #3185

Changes in 6.0:

  • Only fsync global checkpoint if needed #27652
  • Obey translog durability in global checkpoint sync #27641
  • X-Pack:
    • Do not enforce TLS if discovery type is single-node #3245

Changes in 6.1:

  • Correct two equality checks on incomparable types #27688
  • Include internal refreshes in refresh stats #27615
  • Catch InvalidPathException in IcuCollationTokenFilterFactory #27202
  • X-Pack:
    • [Security] Don’t deprecate certgen in 6.1 #3201

Changes in 6.2:

  • Do not open indices with broken settings #26995
  • Implement byte array reusage in NioTransport #27696
  • Cleanup split strings by comma method #27715
  • Add read timeouts to http module #27713
  • Add missing s to tmpdir name #27721
  • Add Open Index API to the high level REST client #27574
  • Added Create Index support to high-level REST client #27351
  • Fix issue where the incorrect buffers are written #27695
  • Extend JVM options to support multiple versions #27675
  • Introduce resizable inbound byte buffer #27551
  • Add a new cluster setting to limit the total number of buckets returned by a request #27581
  • Add validation of keystore setting names #27626
  • Add support for filtering mappings fields #27603
  • [Geo] Add Well Known Text (WKT) Parsing Support to ShapeBuilders #27417
  • Add node name to thread pool executor name #27663
  • Simplify rejected execution exception #27664
  • Add msearch api to high level client #27274
  • Detect mktemp from coreutils #27659
  • Add explicit coreutils dependency #27660
  • Fix term vectors generator with keyword and normalizer #27608
  • Fix highlighting on a keyword field that defines a normalizer #27604
  • Tighten the CountedBitSet class - Followup of #27547 #27632
  • X-Pack:
    • [Security] Make generated passwords safe to be used in shell scripts #3253
    • Add API for SSL certificate information #3088

Changes in 7.0:

  • X-Pack:
    • Filter mappings fields when field level security is configured #3173

Apache Lucene

Lucene 7.2.0

We are fixing some last-minute bugs before building the first release candidate. We should hopefully have a release in the coming weeks.

Dynamic pruning of non-competitive hits

LUCENE-4100 was merged, allowing for great speedups when only the top matches are needed. This optimization requires scorers to expose a maximum score that they may contribute and currently only works well with BM25, but we are working on fixing other similarities to work well with this optimization and improving their explanations.

We are also looking into how we could record the per-term maximum term frequencies, or maybe even the per-norm per-term maximum term frequencies in order to get better upper bounds of the produced scores, which would in-turm make the optimization more efficient.

It turns out the same API could be used to speed up phrase queries, when the term term frequency of the most frequent term is not high enough to produce a competitive score.

Finer-grained flushing

At index time, documents are first buffered in an in-memory index buffer. Seeing it as a single buffer is a bit of a simplification though: there is actually a set of index buffers. This helps with concurrency since different threads can write to different index buffers concurrently. When refreshing, each index buffer writes a segment.

In order to make multi-tenancy easy, Elasticsearch adds an abstraction layer on top of Lucene called IndexingMemoryController which tries to make sure that each shard can use as much memory as possible for these buffers at index-time, because it makes indexing faster, while also ensuring that the total amount of memory that is spent on the index buffers across shards does not exceed indices.memory.index_buffer_size.

The issue is that when this shared limit is reached, Elasticsearch tells Lucene to do a refresh, which writes _all_ per-thread buffers to disk. This new feature is a way to tell Lucene to release some of the memory it spends on indexing, only flushing one of the largest per-thread buffers. This means we will write larger index buffers on average, which should make indexing more efficient.

Other