Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
ICYMI: "Protecting your data in #Elasticsearch" webinar and demo is now available on-demand. Watch now: https://t.co/GolUmTw71C pic.twitter.com/74F5ErJbsZ
— elastic (@elastic) February 3, 2017
Unified highlighter
The new unified
highlighter solves many of the problems that existed with previous highlighters, including accounting for gaps created by stopword filters. The highlighter can either analyze text directly (plain
mode), use postings offsets (postings
mode), or use term vectors (fvh
mode). This should be your new go-to highlighter, although it is still missing a few features supported by other highlighters, such as an upper limit on fragment size, the ability to highlight a single field based on matches in multiple fields (matched_fields
), and the ability to collapse contiguous highlights.
Groundwork laid for operation-based recovery
The team has been hard at work on the sequence numbers project and the first high level feature has landed last week. Sequence numbers are now used for operation based recovery (as opposed to file based sync). With operation based recovery a primary can bring a replica up to speed by only streaming the operations that happened while the replica was offline. This is a great advantage compared to file base syncing which potentially requires copying gigabytes of data. Operation based recovery is only done if the relevant operations can be found in the primary translog. At the moment the chances of this happening are small and in practice this will only happen when a replica was temporary off line. Future work will make this more likely to a point where op based recovery becomes the standard.
Lucene discussing optimisations for leading wildcards
Fast infix searching (*abc*) or even just leading wildcard searching (*abc) has come up on Lucene's users list in the past, but has never been implemented, in part because it's seen of an abuse of a search engine: really, you should do a good job tokenizing during indexing up front so that you don't need such costly sub-token operations at search time. But it's also partly because nobody had enough of an itch to actually do the work, that is until now! The initial patch on the issue is too invasive and very heap heavy (using suffix arrays) and isn't using the most efficient (yet, complex) known approach for building suffix arrays, yet the subsequent discussion is a nice demonstration of how healthy open source iterations unfold: one response is to fold the approach into a custom PostingsFormat, while another is to use FSTs to reduce heavy heap usage. It's not clear how the issue will finish but it's possible Lucene will soon offer a better solution for infix searches.
Changes in 5.2:
- Lucene upgraded to 6.4.1 which fixes a memory leak when using
best_compression
and another leak when usingspan
queries with the query cache. - Reindex-from-remote should not complain when it fails to clear-scroll against an old cluster, which may have cleared the scroll automatically.
- Reindex-from-remote should explicitly request the
_source
when reindexing from pre-2.0 clusters. - Painless bug fixes: Fix def invoked qualified method refs, and don't allow casting from void to def.
- Field names with
..
should be rejected. - The cluster-allocation-explain API should include information about stale shard copies when explaining an unassigned primary.
- The HDFS repository path setting was not settable.
- The mime4j library was missing in the ingest attachment and mapper attachment plugins.
- Added the new
unified
highlighter. - Progress on the Java High Level REST client continues with: parsing the root cause from Elasticsearch exceptions, and support for get() and exists() methods.
- The
enable_position_increments
parameter was misspelled. - Add Painless support for multi-value date fields.
- Add an option to make a valid content-type header required on REST requests.
- Stored scripts will no longer use the language as a namespace but instead should be uniquely identifiable by
id
. - Geo-distance calculation
sloppy_arc
is deprecated in favour ofarc
.
- When a primary migrates from an old node to a new node which supports sequence numbers, any operations without sequence numbers in the translog should still be replayed.
- Single-doc updates should use the bulk action instead, like index and delete.
- Removed deprecated geo-search features:
geo_distance_range
query, andcoerce
andignore_malformed
mapping parameters. - Certain replica failures (eg refresh or global checkpoint syncs) should not result in shard failure.
- The Azure repository plugin will no longer autocreate the Azure container.
- The S3 repository
region
setting has been removed in favour of the existingendpoint
setting. - Connect socket permissions have been removed from core.
- The translog needs to be made sequence-number aware so that enough translog generations can be kept around to ensure that all operations after a certain sequence number can be replayed.
- Could Painless be made to work outside Elasticsearch?
- Doc values will be formatted according to their field type instead of returning their internal representation.
Apache Lucene
- A 6.4.1 release candidate is available, fixing two memory leak issues, and a 5.5.4 release will likely come soon to fix the same issues
- Query parsers will soon analyze the token graph for articulation points, or cut vertices it order to create more efficient queries for multi-token synonyms
- Suffix arrays allow for fast infix and suffix searching at the expense of added RAM, but the current patch should really be reworked as a custom postings format
LatLonPoint's
polygon query gets faster by pre-computing the relations for the polygon with a grid, giving a big jump in our nightly geo benchmarksToParentBlockJoinQuery
now implements two-phased iteration so that it can be executed more efficiently when added to aBooleanQuery
along with other required clauses- Lucene's default analyzer,
StandardAnalyzer,
should not remove English stop words by default - Multi polygons around the 180th meridian, such as areas around the Fiji islands, result in poor polygon filtering performance using
LatLonPoint
today, but the potential fix proves controversial BKDReader,
used to visit matching dimensional points for a query, could more efficiently pre-allocate up front for the expected number of hits- Our geo distance queries could get faster with a better first pass approximation than a bounding box
- We will deprecate the postings-based
GeoPointField
since the dimensional points implementation is faster and smaller - It is far too easy to accidentally create an index that messes up block joins
- Should
UpgradeIndexMergePolicy,
used today only by theIndexUpgrader
tool, be generalized for wider use? - Our GPG KEYS file was suddenly empty blocking the 6.4.1 release for a bit and causing us to revisit simplifying how the
KEYS
file is generated AnalyzingInfixSuggester
now defers opening anIndexWriter
until new suggestions need to be indexed- Lucene will not get an implementation of a logistic regression classifier yet because there is no simple way to train the logistic regression model in Lucene
SortedNumericDocValues
can soon be wrapped as aValueSource
using a selector, making their values accessible to things like function queries and expressions- The
CachingNaiveBayesClassifier
is failing to compute the prior probability on the categories ComplexPhraseQueryParser
confusingly requires double-escaping a trailing colon inside a phrase- The
CannedTokenStream
class, used for testing tokenizers, failed to include the type attribute in each token LatLonPoint's
distance query does way too much work when the area within the specified distance crosses the 180th meridianToParentBlockJoinCollector
is now gone, replaced with a new more efficient query wrapper to locate matching child documents for a given parent document, allowing us to remove the odd dependency between thejoin
module and thegrouping
module- The XML query parser now implements the
SpanQueryBuilder
interface - Javadocs for the
join
module had some small typos - The recent optimization to estimating how many points a given shape will visit caused another fun test failure, but it was fortunately just a test bug
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!