We have started working on saving pre-aggregated data structures directly in a document. This would allow clients, exporters, APM agents, etc to pre-aggregate data locally using agreed-upon algorithms (stuff like HyperLogLog++ for cardinality, HDRHistogram for percentiles) and then index the sketch rather than sending all the raw data. At query time the sketches are loaded and merged, instead of querying raw data.

Geo

We have been investigating ways to enhance existing field-types in mappings to include custom fields introduced from plugins. We're still chewing through the options, but it looks like we may be using multi-fields as a way to enhance the mapping from a plugin.

Cluster privileges for API keys

We have started working on fine grained cluster privileges for API keys. this enhancement is all about providing finer grained control on API key creation and validation.

Admins can use a cluster privilege, manage_api_key, to control who should create API keys, but also should not invalidate those keys, and also allow users to restrict API key invalidation to the keys both owned by the user and having the privilege.

We have discussed two options: one is to add a separate REST endpoint for invalidation of API keys or retrieval of API keys for authenticate user, the second option was more on the lines of Object level security. We have decided to go with adding a separate endpoint.

Enrich documents at ingest time (#32789)

An enrich policy defines how, what, and how often to synchronize fields from a source index (a normal index) to a .enrich index (a specialized index used for enrichments). The first iteration of the enrich policy has been merged (#41003).

The policy runner reads the enrich policy and uses re-indexing and aliases to implement the synchronization of data. (#41088)

Enrichment processors will be implemented as plugins. We are working on the changes necessary to expose the .enrich index to a plugin for optimal performance. The enrich processor(s) will leverage Lucene level interactions against data local single segment shards. (#41010)

Snapshot and Restore UI

We merged snapshots table and detail panel and also merged some nice UX improvements to repositories UI. We worked on adding client-side validation to repository form and added link to filtered snapshots from a repository's details.

Networking

We finished up the unification of our network settings in #36652. The settings have been completely removed from 8.0 and deprecated in 7.x with appropriate breaking changes documentation.

We also opened a pull request to remove the dedicated TLS write buffer in our transport-nio TLS implementation in #41283. The fundamental problem here is that we allocate a single network write buffer for each channel. When we have data to send to the network, we fill the write buffer with the encrypted data and then flush. This requires constantly context switching back and forth between flushing and encrypting. It is the approach that Kafka takes for their TLS implementation, but is inferior to the approach that our existing Netty-provided implementation takes (encrypting all at once into multiple buffers and then flushing). All of the encrypted network buffers are allocated in this PR. The next steps after this PR is merged is to support buffer recycling and removing the single read buffer.

We worked on #40866 to move the initial handling of bulk requests off the transport thread and onto a WRITE thread. The problem here was that bulk requests can be large enough that even parsing them on the transport thread is too expensive: it could take ~100ms to parse and dispatch the resulting tasks to each shard, which introduces significant and unwanted latency into network request handling. This was discovered during an investigation into other slow activities taking place on transport threads in #39658 which were leading to unstable tests.

Apache Lucene

Lucene master/9.0 requires Java 11

Lucene 9.0 is not expected to go out in less than one year, which makes requiring Java 11 a conservative non-controversial choice. Upcoming 8.x releases will keep Java 8 as a minimum version requirement. We can now start using features of Java 9, 10 or 11 for Lucene 9, which we started doing by simplifying the initialization of some static maps using Java 10's Map#copyOf. Likewise, we could now enable Javadoc search support.

Lucene 8.1

We started a thread about releasing Lucene 8.1. Even though Lucene 8.0 was released on March 14th which is pretty recent, the feature freeze for 8.0 started on January 29th so we accumulated a lot of good changes:

much faster merging of KD trees,
Luke has become a Lucene module,
better defaults regarding when to load the terms dictionary off-heap,
flattening of disjunctions in disjunctions, which helps WAND do a better job at speeding up queries.

The proposed timeline is to cut the branch on April 30th and cut the first release candidate a couple days later.

Follow-ups to block-max WAND

Even though we already knew that our implementation of block-max WAND still had room for improvement, the results of the release benchmarks of some tracks made us want to have a closer look. For instance if you look at the ones for NOAA, some queries like the conjunction of a term query and a range query that both match lots of documents became slower, in spite of now being able to skip counting hits. There is a different case on the Lucene benchmarks when running disjunctions within conjunctions. We spent some time collecting information about things that need to be improved:

Flattening of boolean queries, ie. rewtiting (A OR (B OR C)) as (A OR B OR C). We just implemented it for disjunctions which triggered speedups in the 10-80% range for the queries we tested. The same could be done for conjunctions.
Two-phase iteration support. The new block-max (W)AND scorers don't leverage two-phase iterators, so they are inefficient when applied to queries that produce two-phase iterators such as script queries or phrase queries.
More accurate estimation of minimum scores. Say your query is (A AND B), the minimum score for the query is 4 and the maximum score for A is 3. This means that B should produce scores that are greater than or equal to 1 right? Floating-point arithmetic makes it a bit more complicated than that : if you required that B produces scores that are greater than or equal to 1, you could miss hits. The actual minimum score that B must produce is 0.9999999 (single-precision float) because 0.9999999+3 is equal to 4 as well. We had done simplifications until now to avoid dealing with this issue, but it seems to be what is hurting the NOAA track, so we will have to dig.
Specialization of postings readers. Postings allow you to optionally retrieve documents, frequencies, positions, offsets and payloads. Being able to optimize for the case that only documents and frequencies are requested helps run queries faster. However this is something that we are only doing in the case that counts are needed today, maybe we should start looking into specializing the codec for top-k retrieval too. This is also related to a change that we have been working on, lazy loading of term frequencies.
Reconsider the encoding of postings. Our current implementation uses FOR (Frame-of-reference) based on experiments that we made years ago. Given that WAND doesn't stress postings the same way and that a lot of time has passed since last time we experimented, it might be worth exploring again alternatives.

Luwak donation

Luwak authors are donating Luwak to Lucene. Luwak allows to find matching queries given a document, similarly to Elasticsearch's percolator.

Other

We suggested that we have a query that can run a conjunction or disjunction of multiple ranges on the same field.

There is discussion on how to deal with points that fall on the edge of a polygon and whether they should be considered inside of outside of the polygon.

Changes

Elasticsearch

Changes in 8.0:

BREAKING: Reindex from Remote encoding #41007
Upgrade to lucene 8.1.0-snapshot-e460356abe #40952

Changes in 7.1:

Fix unmapped field handling in the composite aggregation #41280
more_like_this query to throw an error if the like fields is not provided #40632
Handle Bulk Requests on Write Threadpool #40866
[Rollup] Validate timezones based on rules not string comparison #36237
Clean up Node#close. #39317
fix the packer cache script #41183
Fix range query edge cases #41160
Expand beats_system role privileges #40876
Better error messages when pipelines reference incompatible aggs #40068
[ML DataFrame] Data Frame stop all #41156
Clarify some ToXContent implementations behaviour #41000
Add deprecation check for ILM poll interval <1s #41096
Improve error message when polygons contains twice the same point in no-consecutive position #41051

Changes in 7.0:

Mark searcher as accessed in acquireSearcher #41335
Fix issue with subproject test task dependencies #41321
Fix error applying ignore_malformed to boolean values #41261
ProfileScorer should propagate setMinCompetitiveScore. #40958
Validate cluster UUID when joining Zen1 cluster #41063
BlendedTermQuery should ignore fields that don't exist in the index #41125
Use environment settings instead of state settings for Watcher config #41087

Changes in 6.7:

Unified highlighter should ignore terms that target the _id field #41275
SQL: Predicate diff takes into account all values #41346
Fix Broken Index Shard Snapshot File Preventing Snapshot Creation #41310
Always check for archiving broken index settings #41209
Do not create missing directories in readonly repo #41249
SQL: Allow current_date/time/timestamp to be also used as a function escape pattern #41254
put mapping authorization for alias with write-index and multiple read indices #40834
SQL: Fix LIMIT bug in agg sorting #41258
SQL: Translate MIN/MAX on keyword fields as FIRST/LAST #41247
SQL: Tweak pattern matching in SYS TABLES #41243
Unified highlighter should respect no_match_size with number_of_fragments set to 0 #41069
Use alias name from rollover request to query indices stats #40774
Full text queries should not always ignore unmapped fields #41062

Changes in Elasticsearch Management UI

Changes in 7.1:

adding spec to console utility as Kibana script #35232
adding data_frame autocomplete rules to dev console #35086

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.1:

Introduce support for the TIME data type conversions #144
DST fixes unit test #145
Apply DST #142
Introduce auto-escaping of pattern arguments in catalog functions #143

Changes in Rally

Changes in 1.1.0:

Store mean for response-related metrics #683
Use single node discovery type if suitable #681

Changes in Rally Tracks

Allow to override index settings in all cases #72
Update target throughput #73
Allow to override number of shards for updates #71

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2019-04-21

Elasticsearch Highlights

Apache Lucene

Changes

Elasticsearch

Changes in Elasticsearch Management UI

Changes in Elasticsearch SQL ODBC Driver

Changes in Rally

Changes in Rally Tracks

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS