04 May 2018 Engineering

This Week in Elasticsearch and Apache Lucene - 2018-05-04

By Tom CallahanAdrien Grand

Announcement

Elasticsearch has adopted maintaining a CHANGELOG file in the docs folder of our github repository. This changelog will track noteworthy changes going forward.

Sighting

Oracle highlighted Elasticsearch in a recent blog post for our rapid adoption of the new JDK release cadence. In particular, they referenced our blog post from February.

Features

Added support for the Field Capabilities API to the Java high-level rest client. In addition, she fixed several bugs on the Field Capabilities API itself. This work is another important step towards bringing the Java rest client to feature parity with the transport client!

Added node stats telemetry device to Rally for recording responses from the node-stats API during benchmarking; this reports useful sampling of indices stats, thread pool stats, JVM stats, and circuit breaker stats.

Improved our licensing code to be FIPS-140 compliant. This is one of our items on the roadmap to FIPS-140 compliance in Elasticsearch.

Improvements and Bugfixes

The distributed team recently merged an improvement that enhances Elasticsearch’s robustness in the face of errors inside of bulk operations. In certain situations, the primary node would advance its local checkpoint, however as the bulk operation failed these operations would not make it to replicas -- causing the local checkpoint on the replicas to lag. This improvement is very promising in terms of addressing some long-term replica-divergence issues in Elasticsearch.

Fixed a bug around snapshot/restore functionality which caused a NullPointerException to be thrown in certain cases where a repository has been mis-used or corrupted. Elasticsearch will now report a sensible error in this case.

Fixed a bug where the validate query API would break if the explain option was used with certain types of queries that required fetches, such as a terms query.

Fixed a long-standing user-reported performance bug in the terms aggregation when aggregating a high cardinality field.

Fixed a performance bug in index resolution in the security plugin where an array was being repeatedly copied. This bug yielded a significant performance degradation for a customer upgrading from 2.4 to 5.6 with security enabled and a large number of index aliases.

Changes in 5.6 (release date TBD):

  • Bulk operation fail to replicate operations when a mapping update times out #30244
  • Security: reduce garbage during index resolution #30180

Changes in 6.3:

  • SQL: Fix bug caused by empty composites #30343
  • Fix merging logic of Suggester Options #29514
  • Cancelling a peer recovery on the source can leak a primary permit #30318
  • ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled #30316
  • Create default ES_TMPDIR on Windows #30325
  • Core: Pick inner most parse exception as root cause #30270

Changes in 6.4:

  • Add a new _ignored meta field. #29658
  • 6.x Backport: Terms query validate bug #30319
  • Make licensing FIPS-140 compliant #30251
  • Watcher: Make start/stop cycle more predictable and synchronous #30118
  • Packaging: Set elasticsearch user to have non-existent homedir #29007
  • Fix NPE when CumulativeSum agg encounters null value/empty bucket #29641
  • SQL: Reduce number of ranges generated for comparisons #30267
  • Fix failure for validate API on a terms query #29483
  • REST Client: Add Request object flavored methods #29623
  • SQL: Teach the CLI to ignore empty commands #30265
  • Allow copying source settings on resize operation #30255
  • index name added to snapshot restore exception #29604
  • _cluster/state should always return cluster_uuid #30143
  • Do not ignore request analysis/similarity on resize #30216

Changes in 7.0:

  • Update versions for start_trial after backport #30218
  • BREAKING: Network: Remove http.enabled setting #29601

Lucene 

7.3.1 Release

First release candidate built and vote started.

Elasticsearch 6.4 will be on Lucene 7.4

The master and 6.x branches are being upgraded to use a snapshot of Lucene 7.4, which brings many features we are interested in like a new korean analyzer, support for soft deletes, a shingle filter that works on synonyms and more.

How to leverage the matches API for highlighting?

We would like to leverage the new matches API for highlighting, but it is not as straightforward as it sounds. For instance we would need to know the matching term at a given position, which is currently not exposed. Also the API makes it hard to use term vectors as a source for offsets.

Hardening soft deletes

We are working hard on improving doc-value updates and soft deletes:

- Soft deletes are now in sync with doc values to avoid confusing retention policies.

- Only hard deletes should be carried over during a merge.

- Binary and numeric doc-value updates now better share code.

- Doc-value updates are now more efficient and have a safer API.

Codecs now expose raw impacts

Previously you needed to pass a SimScorer to postings so that they would give you upper bounds for the produced scores. They now return the raw (freq,norm) pairs that may trigger competitive scores, which improves testing and might allow to work on queries that merge frequencies rather than scores in the future, such as SynonymQuery.

Other

- Should FilterTermsEnum delegate seekExact?

- Should MatchesIterator expose the current term?

- Can we add an option on ngrams to keep terms that are shorter than the minimum gram size?

- Bugs were found about the context suggester not liking empty regexps or regexps that start with a dot.

- CheckIndex now cross-checks norms and terms.

- Should we use a sparse encoding for live docs when deletes are rare?

- It is possible to add corrupt data to an index via IndexWriter.addIndexes.