This Week in Elasticsearch and Apache Lucene - 2017-02-06
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
unified highlighter solves many of the problems that existed with previous highlighters, including accounting for gaps created by stopword filters. The highlighter can either analyze text directly (
plain mode), use postings offsets (
postings mode), or use term vectors (
fvh mode). This should be your new go-to highlighter, although it is still missing a few features supported by other highlighters, such as an upper limit on fragment size, the ability to highlight a single field based on matches in multiple fields (
matched_fields), and the ability to collapse contiguous highlights.
Groundwork laid for operation-based recovery
The team has been hard at work on the sequence numbers project and the first high level feature has landed last week. Sequence numbers are now used for operation based recovery (as opposed to file based sync). With operation based recovery a primary can bring a replica up to speed by only streaming the operations that happened while the replica was offline. This is a great advantage compared to file base syncing which potentially requires copying gigabytes of data. Operation based recovery is only done if the relevant operations can be found in the primary translog. At the moment the chances of this happening are small and in practice this will only happen when a replica was temporary off line. Future work will make this more likely to a point where op based recovery becomes the standard.
Lucene discussing optimisations for leading wildcards
Fast infix searching (*abc*) or even just leading wildcard searching (*abc) has come up on Lucene's users list in the past, but has never been implemented, in part because it's seen of an abuse of a search engine: really, you should do a good job tokenizing during indexing up front so that you don't need such costly sub-token operations at search time. But it's also partly because nobody had enough of an itch to actually do the work, that is until now! The initial patch on the issue is too invasive and very heap heavy (using suffix arrays) and isn't using the most efficient (yet, complex) known approach for building suffix arrays, yet the subsequent discussion is a nice demonstration of how healthy open source iterations unfold: one response is to fold the approach into a custom PostingsFormat, while another is to use FSTs to reduce heavy heap usage. It's not clear how the issue will finish but it's possible Lucene will soon offer a better solution for infix searches.
Changes in 5.2:
- Lucene upgraded to 6.4.1 which fixes a memory leak when using
best_compressionand another leak when using
spanqueries with the query cache.
- Reindex-from-remote should not complain when it fails to clear-scroll against an old cluster, which may have cleared the scroll automatically.
- Reindex-from-remote should explicitly request the
_sourcewhen reindexing from pre-2.0 clusters.
- Painless bug fixes: Fix def invoked qualified method refs, and don't allow casting from void to def.
- Field names with
..should be rejected.
- The cluster-allocation-explain API should include information about stale shard copies when explaining an unassigned primary.
- The HDFS repository path setting was not settable.
- The mime4j library was missing in the ingest attachment and mapper attachment plugins.
- Added the new
- Progress on the Java High Level REST client continues with: parsing the root cause from Elasticsearch exceptions, and support for get() and exists() methods.
enable_position_incrementsparameter was misspelled.
- Add Painless support for multi-value date fields.
- Add an option to make a valid content-type header required on REST requests.
- Stored scripts will no longer use the language as a namespace but instead should be uniquely identifiable by
- Geo-distance calculation
sloppy_arcis deprecated in favour of
- When a primary migrates from an old node to a new node which supports sequence numbers, any operations without sequence numbers in the translog should still be replayed.
- Single-doc updates should use the bulk action instead, like index and delete.
- Removed deprecated geo-search features:
- Certain replica failures (eg refresh or global checkpoint syncs) should not result in shard failure.
- The Azure repository plugin will no longer autocreate the Azure container.
- The S3 repository
regionsetting has been removed in favour of the existing
- Connect socket permissions have been removed from core.
- The translog needs to be made sequence-number aware so that enough translog generations can be kept around to ensure that all operations after a certain sequence number can be replayed.
- Could Painless be made to work outside Elasticsearch?
- Doc values will be formatted according to their field type instead of returning their internal representation.
- A 6.4.1 release candidate is available, fixing two memory leak issues, and a 5.5.4 release will likely come soon to fix the same issues
- Query parsers will soon analyze the token graph for articulation points, or cut vertices it order to create more efficient queries for multi-token synonyms
- Suffix arrays allow for fast infix and suffix searching at the expense of added RAM, but the current patch should really be reworked as a custom postings format
LatLonPoint'spolygon query gets faster by pre-computing the relations for the polygon with a grid, giving a big jump in our nightly geo benchmarks
ToParentBlockJoinQuerynow implements two-phased iteration so that it can be executed more efficiently when added to a
BooleanQueryalong with other required clauses
- Lucene's default analyzer,
StandardAnalyzer,should not remove English stop words by default
- Multi polygons around the 180th meridian, such as areas around the Fiji islands, result in poor polygon filtering performance using
LatLonPointtoday, but the potential fix proves controversial
BKDReader,used to visit matching dimensional points for a query, could more efficiently pre-allocate up front for the expected number of hits
- Our geo distance queries could get faster with a better first pass approximation than a bounding box
- We will deprecate the postings-based
GeoPointFieldsince the dimensional points implementation is faster and smaller
- It is far too easy to accidentally create an index that messes up block joins
UpgradeIndexMergePolicy,used today only by the
IndexUpgradertool, be generalized for wider use?
- Our GPG KEYS file was suddenly empty blocking the 6.4.1 release for a bit and causing us to revisit simplifying how the
KEYSfile is generated
AnalyzingInfixSuggesternow defers opening an
IndexWriteruntil new suggestions need to be indexed
- Lucene will not get an implementation of a logistic regression classifier yet because there is no simple way to train the logistic regression model in Lucene
SortedNumericDocValuescan soon be wrapped as a
ValueSourceusing a selector, making their values accessible to things like function queries and expressions
CachingNaiveBayesClassifieris failing to compute the prior probability on the categories
ComplexPhraseQueryParserconfusingly requires double-escaping a trailing colon inside a phrase
CannedTokenStreamclass, used for testing tokenizers, failed to include the type attribute in each token
LatLonPoint'sdistance query does way too much work when the area within the specified distance crosses the 180th meridian
ToParentBlockJoinCollectoris now gone, replaced with a new more efficient query wrapper to locate matching child documents for a given parent document, allowing us to remove the odd dependency between the
joinmodule and the
- The XML query parser now implements the
- Javadocs for the
joinmodule had some small typos
- The recent optimization to estimating how many points a given shape will visit caused another fun test failure, but it was fortunately just a test bug
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!