This Week in Elasticsearch and Apache Lucene - 2016-09-19
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Building an AR navigation system for visually impaired users with ElasticSearch https://t.co/YGMd4tPu6H @elastic @openstreetmap @mapzen
— Erik Schlegel (@erikschlegel1) August 29, 2016
Elasticsearch Core
Changes in 2.x:
- DiskThresholdDecider now counts the size of incoming shards but no longer discounts the size of outgoing shards.
- Another geo query (GeoPointMultiTermQuery) should be skipped when extracting terms for highlighting, but MultiTermQuery inside function scoresshould not be skipped.
Changes in 5.0:
- Append-only indexing with autogenerated IDs is faster because we don't need to check for an existing doc before appending the new one.
- Fsyncing the transaction log is now asynchronous, which allows indexing and document replication to continue making progress. Big performance improvement for spinning disks.
- Upgraded logging to use Log4j2. Logs now include full class names. Log config uses the properties syntax, and a useful error message is thrown if no config file is found.
- Deprecation logging is enabled by default, but deprecation log files have a size limit and are auto-rotated.
- Geo-point fields now use Lucene's LatLonPoint implementation, which doubles the speed of geo-queries.
- The update-aliases command now supports deleting an index at the same time as adding an alias, making atomic index swapping possible.
- An Elasticsearch node should refuse to start in the presence of unsupported indices.
- Master election should choose the node with the highest cluster state version to avoid losing allocation info (and thus acknowledged writes).
- Windows' service.bat renamed to elasticsearch-service.bat.
- When upgrading a package, the bin/lib/modules directories should be removed, but the scripts directory shouldn't.
- The Search and SearchService modules have been deguiced.
- The
discovery-file
plugin reads unicast hosts from a text file before each ping round, which makes the list of unicast hosts updatable. - Painless is the new default scripting language. Inline scripts without a specified lang which already exist in Watcher or percolator will use
script.legacy.default_lang
(Groovy) for backwards compatibility. - Regular expressions are disabled in Painless by default as pathological regexes can be dangerous.
- Ingest now has a JSON processor and a
dotexpander
processor, and most processors support anignore_missing
flag. - Bootstrap checks now have easier-to-read exceptions.
- Ubuntu 16.04 is now officially supported.
Changes in master:
- The cluster name is no longer allowed in the data path.
- The
FORCE
version type is no longer allowed. - Removed the option to allow unquoted JSON keys.
- Numeric fields are no longer included-in-all.
Ongoing changes:
- The rank evaluation framework now supports templated searches.
- Transport is being deguiced.
- List templates with cat-templates API.
- Should we remove the option to disable system-level bootstrap checks?
- Reindex API should only return the first 50 failures.
- Make searches cancelable with the task management API.
- Similarities should be dynamically updatable.
- Query string query params
lowercase_expanded_terms
andlocale
are no longer needed.
Apache Lucene
- It looks like the next patch release of IBM's J9 JVM can pass all Lucene tests
- Lucene 5.5.3 was released and the 6.2.1 release will be out soon
- Lucene's internal doc values API will soon switch from random access to iterator enabling previously difficult codec level optimizations
- A long-standing rare overflow bug can cause
ArrayIndexOutOfBoundsExceptions
when skipping across many documents in a large index - Lucene should probably not apply English stop-words by default, but should still make it simple to provide stop-words to the default
StandardAnalyzer
TestBoolean2
keeps false failing withOutOfMemoryError
CustomScoreQuery
should not score hits that an embeddedBooleanQuery
will filter out- The new
MinHashFilter
was not validating incoming arguments RangeField
gets anewCrossesQuery
to find all documents intersecting a strict subset of the query range- The subset of Wikipedia documents that some Lucene tests use contains some too-massive terms, sometimes tripping up tests
- The graduation of
StandardAnalyzer
to Lucene core's default analyzer caused some havoc because package names of popular analysis components were changed, and some factories are now in a different package than their filters - Tests should not rely on wall-clock time
FuzzyQuery
now matches all terms, even short ones, within the requested edit distance, and its sources in Lucene's master are substantially simplerRangeField
tests were frequently testing empty ranges- Our release tools struggle checking backwards compatibility with all our numerous point releases, were missing some features from their original perl origins, don't handle a mis-typed GnuPG password correctly and do not work with git 2.9.3
SpanNotQuery
was already generalized to accept anear
parameter, and nested span queries somehow broke between 4.10.x and today.- A long-standing rare overflow bug can cause
ArrayIndexOutOfBoundsExceptions
when skipping across many documents in a large index - Confusingly,
FuzzyQuery
won't match short terms correctly, but this is a long-standing legacy issue that may be tricky to fix - Another user hits the common pitfall of query parsers not analyzing the text in a wildcard query
- A new
UnifiedHighlighter
generalizesPostingsHighlighter
to be able to pull offsets from either postings, term vectors or via re-analysis - The classic highlighter no longer throws an
IllegalArgumentException
on aMultiPhraseQuery
with only one clause SpanNearQuery
should accept aminimumNumberShouldMatch
option, but perhapsTermAutomatonQuery
could also be used- Some small cleanups of unused code in
LogMergePolicy
MinHashFilter
accidentally had a package private constructor, making it challenging to use directly- Unordered
SpanNearQuery
can miss hits and can include the wrong hits when groups are repeated, and nested span queries somehow broke between 4.10.x and today - How to handle holes created during analysis (by stages like
StopFilter
) is tricky - The graduation of
StandardAnalyzer
to Lucene core's default analyzer caused some havoc because package names of popular analysis components were changed - A tricky
Geo3D
failure happens when a quantized point sits just above the north pole - Should
SpanNotQuery
let you specify the allowed range over overlap between sub-queries? DelegatingAnalyzerWrapper
fails to delegate pre-tokenization character normalization- Some small cleanups of unused code in
LogMergePolicy
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!