06 February 2017

This Week in Elasticsearch and Apache Lucene - 2017-02-06

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

ICYMI: "Protecting your data in #Elasticsearch" webinar and demo is now available on-demand. Watch now: https://t.co/GolUmTw71C pic.twitter.com/74F5ErJbsZ
— elastic (@elastic) February 3, 2017

Unified highlighter

The new unified highlighter solves many of the problems that existed with previous highlighters, including accounting for gaps created by stopword filters. The highlighter can either analyze text directly (plain mode), use postings offsets (postings mode), or use term vectors (fvh mode). This should be your new go-to highlighter, although it is still missing a few features supported by other highlighters, such as an upper limit on fragment size, the ability to highlight a single field based on matches in multiple fields (matched_fields), and the ability to collapse contiguous highlights.

Groundwork laid for operation-based recovery

The team has been hard at work on the sequence numbers project and the first high level feature has landed last week. Sequence numbers are now used for operation based recovery (as opposed to file based sync). With operation based recovery a primary can bring a replica up to speed by only streaming the operations that happened while the replica was offline. This is a great advantage compared to file base syncing which potentially requires copying gigabytes of data. Operation based recovery is only done if the relevant operations can be found in the primary translog. At the moment the chances of this happening are small and in practice this will only happen when a replica was temporary off line. Future work will make this more likely to a point where op based recovery becomes the standard.

Lucene discussing optimisations for leading wildcards

Fast infix searching (*abc*) or even just leading wildcard searching (*abc) has come up on Lucene's users list in the past, but has never been implemented, in part because it's seen of an abuse of a search engine: really, you should do a good job tokenizing during indexing up front so that you don't need such costly sub-token operations at search time. But it's also partly because nobody had enough of an itch to actually do the work, that is until now! The initial patch on the issue is too invasive and very heap heavy (using suffix arrays) and isn't using the most efficient (yet, complex) known approach for building suffix arrays, yet the subsequent discussion is a nice demonstration of how healthy open source iterations unfold: one response is to fold the approach into a custom PostingsFormat, while another is to use FSTs to reduce heavy heap usage. It's not clear how the issue will finish but it's possible Lucene will soon offer a better solution for infix searches.

Changes in 5.2:

Lucene upgraded to 6.4.1 which fixes a memory leak when using best_compression and another leak when using span queries with the query cache.
Reindex-from-remote should not complain when it fails to clear-scroll against an old cluster, which may have cleared the scroll automatically.
Reindex-from-remote should explicitly request the _source when reindexing from pre-2.0 clusters.
Painless bug fixes: Fix def invoked qualified method refs, and don't allow casting from void to def.
Field names with .. should be rejected.
The cluster-allocation-explain API should include information about stale shard copies when explaining an unassigned primary.
The HDFS repository path setting was not settable.
The mime4j library was missing in the ingest attachment and mapper attachment plugins.

Changes in 5.x:

Added the new unified highlighter.
Progress on the Java High Level REST client continues with: parsing the root cause from Elasticsearch exceptions, and support for get() and exists() methods.
The enable_position_increments parameter was misspelled.
Add Painless support for multi-value date fields.
Add an option to make a valid content-type header required on REST requests.
Stored scripts will no longer use the language as a namespace but instead should be uniquely identifiable by id.
Geo-distance calculation sloppy_arc is deprecated in favour of arc.

Changes in master:

When a primary migrates from an old node to a new node which supports sequence numbers, any operations without sequence numbers in the translog should still be replayed.
Single-doc updates should use the bulk action instead, like index and delete.
Removed deprecated geo-search features: geo_distance_range query, and coerce and ignore_malformed mapping parameters.
Certain replica failures (eg refresh or global checkpoint syncs) should not result in shard failure.
The Azure repository plugin will no longer autocreate the Azure container.
The S3 repository region setting has been removed in favour of the existing endpoint setting.
Connect socket permissions have been removed from core.

Upcoming changes:

The translog needs to be made sequence-number aware so that enough translog generations can be kept around to ensure that all operations after a certain sequence number can be replayed.
Could Painless be made to work outside Elasticsearch?
Doc values will be formatted according to their field type instead of returning their internal representation.

Apache Lucene

A 6.4.1 release candidate is available, fixing two memory leak issues, and a 5.5.4 release will likely come soon to fix the same issues
Query parsers will soon analyze the token graph for articulation points, or cut vertices it order to create more efficient queries for multi-token synonyms
Suffix arrays allow for fast infix and suffix searching at the expense of added RAM, but the current patch should really be reworked as a custom postings format
LatLonPoint's polygon query gets faster by pre-computing the relations for the polygon with a grid, giving a big jump in our nightly geo benchmarks
ToParentBlockJoinQuery now implements two-phased iteration so that it can be executed more efficiently when added to a BooleanQuery along with other required clauses
Lucene's default analyzer, StandardAnalyzer, should not remove English stop words by default
Multi polygons around the 180th meridian, such as areas around the Fiji islands, result in poor polygon filtering performance using LatLonPoint today, but the potential fix proves controversial
BKDReader, used to visit matching dimensional points for a query, could more efficiently pre-allocate up front for the expected number of hits
Our geo distance queries could get faster with a better first pass approximation than a bounding box
We will deprecate the postings-based GeoPointField since the dimensional points implementation is faster and smaller
It is far too easy to accidentally create an index that messes up block joins
Should UpgradeIndexMergePolicy, used today only by the IndexUpgrader tool, be generalized for wider use?
Our GPG KEYS file was suddenly empty blocking the 6.4.1 release for a bit and causing us to revisit simplifying how the KEYS file is generated
AnalyzingInfixSuggester now defers opening an IndexWriter until new suggestions need to be indexed
Lucene will not get an implementation of a logistic regression classifier yet because there is no simple way to train the logistic regression model in Lucene
SortedNumericDocValues can soon be wrapped as a ValueSource using a selector, making their values accessible to things like function queries and expressions
The CachingNaiveBayesClassifier is failing to compute the prior probability on the categories
ComplexPhraseQueryParser confusingly requires double-escaping a trailing colon inside a phrase
The CannedTokenStream class, used for testing tokenizers, failed to include the type attribute in each token
LatLonPoint's distance query does way too much work when the area within the specified distance crosses the 180th meridian
ToParentBlockJoinCollector is now gone, replaced with a new more efficient query wrapper to locate matching child documents for a given parent document, allowing us to remove the odd dependency between the join module and the grouping module
The XML query parser now implements the SpanQueryBuilder interface
Javadocs for the join module had some small typos
The recent optimization to estimating how many points a given shape will visit caused another fun test failure, but it was fortunately just a test bug

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2017-02-06

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS