This Week in Elasticsearch and Apache Lucene - 2018-09-07
Google Summer of Code
We merged the last of the work from Sohaib, our Google Summer of Code student. This was an incredibly impressive showing, with the Sohaib completing eleven High-level Rest Client APIs over the summer! Sohaib's mentors spoke very highly of him throughout this process. Thank you for you contributions, Sohaib!
We’re continuing improvements to our Kerberos support. Firefox has a long standing bug where the browser may not use the most secure authentication challenge it receives, instead using the first
WWW-Authenticate challenge it receives when there is more than one such header present. In order to workaround this issue, we have added auth scheme sorting. The initial Kerberos support relies on role mapping, which can utilize metadata to map roles so we have added support for adding the realm and full user principal name as metadata fields.
We have also completed the first integral part of auto-follow patterns, which allows users to define index patterns which will automatically trigger the creation of a replication relationship. Whenever an index is created on a remote cluster matching the wildcard pattern defined by the user, a corresponding index will be created on the following cluster and the replication relationship will be established. This feature will allow anyone with a time-based use cases (e.g., logging) to ensure that rollover/creation of new indices plays well with CCR.
We upgraded the master branch of Elasticsearch to a snapshot of Lucene 8. This is important to Elasticsearch as it allows us to start tracking any Lucene-induced bugs or performance regressions. More work remains to be done in order to leverage new features of Lucene 8.
Early termination for aggregations
We have been working on adding early termination to aggregations. This allows aggregations to add shortcuts in the case where they do not require all documents in the current segment to be collected so the collection for the segment can terminate early and not waste time collecting documents that will not change the aggregation results. The first aggregations to take advantage of this are the min and max aggregations, which will skip collecting individual documents for a segment if the query matches all documents and instead will consult the BKD tree to extract the min/max value for that segment.
We added the ability to drop documents in a processor. This feature, inspired by the drop filter in Logstash, has been requested for some time. This functionality enables users to add a processor to a pipeline that can drop documents for any reason of their choosing. The typical use-case would be based on a conditional.
We made an approximately 40% performance improvement to the geoip processor. This was a typical memory/performance tradeoff; we went from approximately 215 KB to 6 MB in a full cache to avoid de-serializing IP -> geo cache hits.
SQL now supports multi-index patterns
We added a support for querying multiple indexes in SQL using the multi-index patterns supported in the ES APIs. this means SQL users will now be able to perform queries like SELECT * FROM "tes*,-test*".
Breaking change in networking settings
Because of the role that remote cluster connections are taking wider than cross-cluster search (e.g., cross-cluster replication will use the same infrastructure), we have merged a change to rename the search.remote.*settings to cluster.remote.*. This change will appear in 6.5.0 with the search.remote.* settings being deprecated. We will fallback from cluster.remote.* settings to search.remote.* settings at runtime for the life of 7.x so that users existing configurations will continue to work. The fallback will be removed in 8.0.0. The long support for old settings is needed because a user could perform a full cluster restart from < 6.5.0 to >= 7.0.0. Additionally, we will automatically upgrade to cluster.remote.* any search.remote.*settings that are in the cluster state or set on dynamic cluster settings updates.
Changes in 6.4:
- Fix IndexMetaData loads after rollover #33394
- Core: Fix IndicesSegmentResponse.toXcontent() serialization #33414
- Allow query caching by default again #33328
- Null completion field should not throw IAE #33268
- [Rollup] Log deprecation if config and request intervals are mixed #33284
- [Rollup] Fix Caps Comparator to handle calendar/fixed time #33336
Changes in 6.5:
- [CCR] Added auto follow patterns feature #33118
- HLRest: add xpack put user API #32332
- [SECURITY] Set Auth-scheme preference #33156
- MockTcpTransport to connect asynchronously (#28203) #33476
- [ingest] geo-ip performance improvements #33029
- 6.x - HLRC: Add ILM Status to HLRC (#33283) #33448
- [ingest] 6.x backport - geo-ip performance improvements #33446
- Acquire seacher on closing engine should throw AlreadyClosedException #33331
- BREAKING: Fix generics in ScriptPlugin#getContexts() #33426
- Add interval response parameter to AutoDateInterval histogram #33254
- Add user-defined cluster metadata #33325
- Add sni name to SSLEngine in netty transport #33144
- Add an index setting to control TieredMergePolicy#deletesPctAllowed #32907
- Don’t count metadata fields towards index.mapping.total_fields.limit #33386
- INGEST: Implement Drop Processor #32278
- Fix deprecated setting specializations #33412
- Add conditional token filter to elasticsearch #31958
- REST high-level client: add update by query API #32760
- REST high-level client: add delete by query API #32782
- SQL: Align SYS TABLE for ODBC SQL_ALL_* args #33364
- Introduce private settings #33327
- SQL: Show/desc commands now support table ids #33363
- Add early termination support to BucketCollector #33279
Changes in 7.0:
- Generalize search.remote settings to cluster.remote #33413
- Upgrade to a Lucene 8 snapshot #33310
- Fix inner hits retrieval when stored fields are disabled (none) #33018
- BREAKING: LLREST: Drop deprecated methods #33223
Dynamic scoring features
After static scoring features like pagerank or url length, we are adding new tools in order to boost by dynamic features such as recency or geo distance. In particular, using these new queries on their own rather than for boosting is an efficient way to find nearest neighbors to a point. It will be interesting to discuss how best to expose these new capabilities in Elasticsearch.
- The fact that Collector is passed a Scorer is problematic since only a small subset of the Scorer functionality may be used by Collectors. We fixed this issue by passing a new abstract class called Scorable to collectors with only methods that may be used, and making Scorer extend Scorable. This in-turn helped require that Scorers are constructed with a non-null Weight object.
- Should we use different abstractions for queries and scoring contribution from features?
- Can we add a way to remove leaves that only contain deleted documents?
- While the fact that intervals only return minimum intervals helps consistency, it also makes nested disjunctions sometimes perform in a way that is not intuitive.
- The fact that the regex completion query uses UTF32 but that the completion FST internally uses FST means that any non-ASCII characters may never match.
- We implemented the matches API on interval queries.