This Week in Elasticsearch and Apache Lucene - 2017-06-05
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Typeless parent/child support
With the removal of types, we had to come up with a way to support parent/child joins that didn't depend on having multiple mapping types in an index. Good progress has been made in 5.5 with the introduction of the
ParentJoinFieldMapper, a field mapper that creates parent/child relation within documents of the same index. Only one join-field is allowed per index. The join field supports
eager_global_ordinals, which defaults to
true. Parent/child queries and aggregations have been updated to work with the new join field. Existing 5.x parent/child indices will continue to work as before in 6.x.
Changes in 5.4:
- The context suggester can now read contexts from
- When a
termsagg has a sub-agg which can't be deferred use
global_ordinalsto reduce memory requirements.
- The circuit breaker should be incremented before allocating bytes to a BigArray.
- PatternAnalyzer should lowercase wildcard queries when
- The number of
processesis now set in the systemd unit file.
Changes in 5.x:
- POST requests with missing content no longer throw an NPE.
kvingest processor now aligns with Logstash by supporting
- The Ingest date processor used floats instead of doubles, which led to approximate dates.
moving_avgagg didn't always set the correct
- Field collapsing search requests did not include the time it
tookto retrieve inner hits.
- Added XContent serialization for search-scroll request, clear scroll request, and clear scroll response.
- Include the name of the duplicate JAR causing jarhell.
- Eliminate array access in tight loops when profiling is enabled.
elasticsearch-pluginwill remove config files when uninstalling a plugin.
- Always accumulate transport exceptions to avoid the situation where no response is returned due to failure, but no exceptions are reported.
Changes in master:
- [BREAKING] The get-index API no longer allows comma-separated values for
TokenizerFactory#namehas been removed in favour of
- The new nodes-usage API provides information about how often different features are used. Currently this is limited to counting calls to REST endpoints, but these stats can be expanded as needed.
- The Java High Level REST client now supports search,
- The primary term is only incremented once all in-flight requests have completed, allowing for a clean transition of primary term.
- If the primary throws an exception while handling the response from a replica, the primary should be failed, not the replica.
- When a replica is promoted to primary, all gaps in the sequence number history should be filled with noops.
- The IndexDeletionPolicy should be internal to the engine, not passed in from IndexShard.
- The decision to delete a transaction log is no longer hard coded, but contained in the new
- The matrix_stats aggregation now reports the
doc_count, and the significant terms agg reports the
- Since the Lucene 7 upgrade, the script-sort always returned
Double.MAX_VALUEwhen the script returned a number.
- Stored scripts take an optional
contextparameter which, when present, will cause the script to be compiled and validated at PUT time. Painless scripts are able to use these script contexts.
PainlessScriptis now an interface, and the Painless compiler uses an instance per context.
StatefulFactoryTypeas optional intermediate factory in script contexts to support search scripts which create a script per Lucene segment.
- Search response times are now tracked as an EWMA, which will be used to auto-resize the search threadpool queue.
- Plugins can now register pre-configured character filters.
We expect to cut the branch during or soon after Berlin Buzzwords.
After several respins, it looks like the vote for 6.6.0 is about to pass.
Custom term frequencies
Lucene will soon allow you to index custom term frequencies. This is useful if you want to use term frequencies to hold scoring signals. The only way that it could be done before was by using payloads, which come with a higher cost in terms of CPU usage and disk requirements. However this remains a very expert feature, which only works with the DOCS_AND_FREQS index options.
The introduction of a dependency from the classification module on the sandbox module triggered a discussion about when it is fine for modules to depend on each other.
- Our copyright notices were outdated.
- There is disagreement about whether the terms index should be loaded lazily in memory.
- Could we detect duplicate postings lists? For instance, this happens by design if you index the same field both normally and with a ReverseStringFilter in order to support suffix queries.
- Should BKD trees store the actual bounds in each dimension on the leaves instead of relying on approximate values provided by the BKD index?
- IndexSearcher exposes a way to customize how searching a single index can be parallelized, but this feature can be used in such a way that it breaks searchAfter.
- Range queries on range fields could get faster by short-circuiting the query on inner nodes when all points either match or do not match.
- Support for legacy (postings-based rather than points) has been removed from Lucene but will stay in Solr for one additional major version.
- The factory of the wikipedia tokenizer does not expose all its options.
- Gaps in graph phrase queries were ignored.
- Can we optimize norms for the case that values have less than 16 terms?
- Should we move to doc values to store block-join relations?
- There is ongoing work on having a geo bounding box field.
- The introspection API of TermInSetQuery is not good.
- Join queries had equals/hashcode bugs.
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!