This Week in Elasticsearch and Apache Lucene - 2018-05-04
Elasticsearch has adopted maintaining a CHANGELOG file in the docs folder of our github repository. This changelog will track noteworthy changes going forward.
Added support for the Field Capabilities API to the Java high-level rest client. In addition, she fixed several bugs on the Field Capabilities API itself. This work is another important step towards bringing the Java rest client to feature parity with the transport client!
Added node stats telemetry device to Rally for recording responses from the node-stats API during benchmarking; this reports useful sampling of indices stats, thread pool stats, JVM stats, and circuit breaker stats.
Improvements and Bugfixes
The distributed team recently merged an improvement that enhances Elasticsearch’s robustness in the face of errors inside of bulk operations. In certain situations, the primary node would advance its local checkpoint, however as the bulk operation failed these operations would not make it to replicas -- causing the local checkpoint on the replicas to lag. This improvement is very promising in terms of addressing some long-term replica-divergence issues in Elasticsearch.
Fixed a bug around snapshot/restore functionality which caused a NullPointerException to be thrown in certain cases where a repository has been mis-used or corrupted. Elasticsearch will now report a sensible error in this case.
Fixed a bug where the validate query API would break if the explain option was used with certain types of queries that required fetches, such as a terms query.
Fixed a long-standing user-reported performance bug in the terms aggregation when aggregating a high cardinality field.
Fixed a performance bug in index resolution in the security plugin where an array was being repeatedly copied. This bug yielded a significant performance degradation for a customer upgrading from 2.4 to 5.6 with security enabled and a large number of index aliases.
Changes in 5.6 (release date TBD):
- Bulk operation fail to replicate operations when a mapping update times out #30244
- Security: reduce garbage during index resolution #30180
Changes in 6.3:
- SQL: Fix bug caused by empty composites #30343
- Fix merging logic of Suggester Options #29514
- Cancelling a peer recovery on the source can leak a primary permit #30318
- ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled #30316
- Create default ES_TMPDIR on Windows #30325
- Core: Pick inner most parse exception as root cause #30270
Changes in 6.4:
- Add a new _ignored meta field. #29658
- 6.x Backport: Terms query validate bug #30319
- Make licensing FIPS-140 compliant #30251
- Watcher: Make start/stop cycle more predictable and synchronous #30118
- Packaging: Set elasticsearch user to have non-existent homedir #29007
- Fix NPE when CumulativeSum agg encounters null value/empty bucket #29641
- SQL: Reduce number of ranges generated for comparisons #30267
- Fix failure for validate API on a terms query #29483
- REST Client: Add Request object flavored methods #29623
- SQL: Teach the CLI to ignore empty commands #30265
- Allow copying source settings on resize operation #30255
- index name added to snapshot restore exception #29604
- _cluster/state should always return cluster_uuid #30143
- Do not ignore request analysis/similarity on resize #30216
Changes in 7.0:
- Update versions for start_trial after backport #30218
- BREAKING: Network: Remove http.enabled setting #29601
First release candidate built and vote started.
Elasticsearch 6.4 will be on Lucene 7.4
The master and 6.x branches are being upgraded to use a snapshot of Lucene 7.4, which brings many features we are interested in like a new korean analyzer, support for soft deletes, a shingle filter that works on synonyms and more.
How to leverage the matches API for highlighting?
We would like to leverage the new matches API for highlighting, but it is not as straightforward as it sounds. For instance we would need to know the matching term at a given position, which is currently not exposed. Also the API makes it hard to use term vectors as a source for offsets.
Hardening soft deletes
We are working hard on improving doc-value updates and soft deletes:
- Soft deletes are now in sync with doc values to avoid confusing retention policies.
- Binary and numeric doc-value updates now better share code.
- Doc-value updates are now more efficient and have a safer API.
Codecs now expose raw impacts
Previously you needed to pass a SimScorer to postings so that they would give you upper bounds for the produced scores. They now return the raw (freq,norm) pairs that may trigger competitive scores, which improves testing and might allow to work on queries that merge frequencies rather than scores in the future, such as SynonymQuery.
- Should FilterTermsEnum delegate seekExact?
- Should MatchesIterator expose the current term?
- Can we add an option on ngrams to keep terms that are shorter than the minimum gram size?
- Should we use a sparse encoding for live docs when deletes are rare?
- It is possible to add corrupt data to an index via IndexWriter.addIndexes.