This Week in Elasticsearch and Apache Lucene - 2018-07-20
We have merged a new field aliases feature which allows users to specify a new
alias field type. This new field type points has a property, “path”, which points to a different field and will resolve to that field whenever it is used in search and aggregations. This helps users migrate from old mappings to new ones when only field names are changing, allowing them to set field aliases on the old indices which use the new field names, so the old indices are still searchable to applications using the new schema.
The kerberos work is ready to be merged. The initial implementation will rely on native role mapping from the kerberos principal name. Lookups for users from LDAP/AD will come as a follow up based on lookup realms.
This week in the internal Zen2 PoC we merged the node retirement API that allows for safely reducing a 2-node cluster to a 1-node one. This is an important API for Elastic Cloud, where many users use a 1-node cluster which can grow to 2 and back to 1 during upgrades. This is the last missing feature in our PoC.
We are making good progress on the zen2 branch of the public ES repo, which has seen PRs for the addition of terms to cluster state , the heart of the coordination layer , the beginning of testing framework, and a gossip discovery protocol.
Several of our users recently encountered a JVM Bug that manifests on machines supporting AVX-512 (e.g., Skylake X) when running Elasticsearch on JDK 10. As our 6.3.0 and 6.3.1 docker images contain JDK 10, this issue affects potentially many users. A workaround is described on the Elasticsearch Github repo.
Low Memory Resiliency
We made an important step forward this week in our efforts to run on smaller heaps. To date, when evaluating whether or not to service a user’s request, Elasticsearch examined memory explicitly reserved by other queries as compared to a total limit (70% of heap). However, these explicit reservations did not accurately capture memory usage. With a rebuilt circuit breaker, Elasticsearch now measures real heap memory usage before servicing the request. The result? Elasticsearch is now more resilient to OutOfMemoryErrors, while at the same time being able to use more of the heap (95% now).
Fetch tasks are no longer rejected
The search thread pool needs to run two main kinds of tasks: query tasks and fetch tasks. We have pushed a change that force-adds fetch tasks to the queue even in the event that the queue is already full. The reasoning behind this is that fetch tasks may only be follow-up to query tasks, so the number of additional fetch tasks that may enter the thread pool is expected to be reasonable.
Rollup capabilities now available on index-level
We have added the ability to request the rollup capabilities for a rollup index in addition to the existing API which lists the rollup capabilities for all jobs targeting a pattern of source indexes (i.e. indexes where the original non-rolled up data lives). This will help the rollup UI determine the fields and aggregations available to Kibana index patterns which include rollup indexes.
Changes in 6.0:
- Fix put mappings java API documentation #31955
Changes in 6.3:
- A replica can be promoted and started in one cluster state update #32042
- Adjust translog after versionType is removed in 7.0 #32020
- Disable C2 from using AVX-512 on JDK 10 #32138
Changes in 6.4:
- Rest HL client: Add put watch action #32026
- Add support for field aliases. #32172
- add support for write index resolution when creating/updating documents #31520
- ECS Task IAM profile credentials ignored in repository-s3 plugin #31864
- Fix rollup on date fields that don’t support epoch_millis #31890
- Revert "Introduce a Hashing Processor (#31087)" #32179
- Call setReferences() on custom referring tokenfilters in _analyze #32157
- Add more contexts to painless execute api #30511
- Fix range queries on _type field for singe type indices #31756
- BREAKING: Configurable password hashing algorithm/cost(#31234) #32092
- Handle missing values in painless (#30975) #31903
- Painless: Fix Bug with Duplicate PainlessClasses #32110
- Ensure to release translog snapshot in primary-replica resync #32045
- Check that client methods match API defined in the REST spec #31825
- Update monitoring template version to 6040099 #32088
- Add exclusion option to keep_types token filter #32012
- Add Index UUID to /_stats Response #31871
- Bypass highlight query terms extraction on empty fields #32090
- Core: Backport java time date formatters #31997
- [Rollup] Add new capabilities endpoint based on concrete rollup indices #30401
- SQL: allow LEFT and RIGHT as function names #32066
- Watcher: Store username on watch execution #31873
- Scripting: Remove Dead Code from Painless Module #32064
Changes in 7.0:
- Handle missing values in painless #32207
- Revert "Introduce a Hashing Processor (#31087)" #32178
- Adjust SSLDriver behavior for JDK11 changes #32145
- Remove versionType from translog #31945
- Replace TokenizerFactory with Supplier #32063
- Relax TermVectors API to work with textual fields other than TextFieldType #31915
Low-level highlighting components
One reason (out of many!) why building a highlighter is challenging is that many things need to be configured: what is a good snippet, how much context should be kept around matches, should snippets that occur next to each other be merged, etc. We are exploring building highlighter components that can be used together rather than a full-fledged highlighter. As a start, we are using the matches API to break text up into passages that contain hits.
- When sorting by a field, it would be more efficient to run a second search to compute scores of the top hits compared to doing it at collection time.
- Since computing sort values is cheap, we don't need a boolean to enable it.
- Soft deletes support had an optimization that never kicked in because it relied on outdated assumptions.
- The matches API is getting support for extracting sub matches. This is especially useful for phrase queries with slops in order to know where terms of the query occurred.
- Should we allow term vectors to only store a subset of the terms that exist in the inverted index?
- A missing double cast made TieredMergePolicy's getter for the max segment size return a wrong value.