This Week in Elasticsearch and Apache Lucene - 2016-01-25
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Building an image search engine with deep learning and #Elasticsearch: https://t.co/noehFj4y45 #machinelearning pic.twitter.com/D0dJQyye7X
— elastic (@elastic) January 21, 2016
Elasticsearch Core
Changes in 2.2:
- The new completion suggester has been backed out of 2.2 because it was a breaking change. It will now only appear in 3.0.
- The TransportClient should use settings after they have been updated by any plugins.
- Azure plugins now support a timeout setting, which defaults to 5m.
- A corruption bug in Lucene 5.4 has been fixed by upgrading to Lucene 5.4.1.
- Command line options to elasticsearch.bat are now correctly handled on Windows.
- Elasticsearch can now be installed on Windows in paths which contain parentheses.
- A circular reference in translog exceptions could cause a stack overflow.
- Aliases pointing to closed indices were not respecting the ignore_unavailable option during search.
- An unknown type in the URL of a search request could result in a match_all query.
- It is (temporarily) possible to disable the jar hell check during testing.
- Systemd now sends STDOUT to the journal instead of /dev/null to make startup failures easier to debug.
Changes in 2.x:
- It is no longer necessary to re-specify changes to metadata field mappings when updating mappings.
- Minimum-should-match was incorrectly applied to query-time synonyms.
- Adding LeafReaders to a closed index could cause reference leaks.
- Mapping changes now know if they have originated from a mapping update or, for instance, after a node restart, allowing different validation logic to be applied.
Changes in master:
- All index level settings are now validated and resettable, and are validated on metadata upgrade, and all known global settings are now validated at startup.
- A user-configurable safeguard has been added to throw an exception if more than 50 nested fields are added to an index.
- The new scripting language now supports exception handling with throw & try/catch, and can detect infinite loops.
- After a restart, Elasticsearch now prefers to allocate primary shards to the nodes which previously held the primary shard.
- The _parent field is no longer stored or indexed - querying by parent ID can be done with the new parent_id query.
- The load-avg metric will return values for 1m, 5m, and 15m, as an object with named keys rather than as an array.
- Guice is no longer used in query parsers. This change is controversial because it impacts the search refactoring.
- In the task management framework, the task object is now available to the action which initiated the task, so that the request can return the task ID.
- Channel failures while a shard is starting are now retried when the master is missing.
- Shard state action channel exceptions should be handled correctly whether they are local or remote.
- Translogs no longer use reference counting.
- The cluster reroute API now supports forcing the allocation of stale primaries (ie not the latest) and empty primaries.
- Match and query string queries can now search IPv4 fields for a single IPv4 address.
- Site plugins have been removed - a blog post will demonstrate how to migrate site plugins to Kibana.
Ongoing:
- Translog recovery in 2.1 and 2.2 is very slow because shards in recovery are not marked as active and have a small indexing buffer.
- Work continues on splitting the string field type into text and keyword types, including making the "index" param accept true/false, being stricter about boolean settings in mappings, and enabling doc values even when a field is not indexed.
- Query refactoring is moving forward with rescore, suggester and sort elements, but a design bug in the aggregation refactoring (one aggregator instance per node instead of per shard) will require a significant amount of work to fix.
- A new "search_after" parameter will allow for efficient deep pagination of search results.
- The node ingest API is close to being wrapped up.
- Nested inner hits sections don't work properly today.
- The percolator API should not introduce new fields into type mappings.
Apache Lucene
- The move from subversion to git finally happened on Saturday January 23 and we will try to make a simple guide showing how to do the common operations. And the controversy over whether to squash during merge or not has already begun.
- Both Lucene 5.3.2 and 5.4.1 are released!
- Lucene now supports the divergence from independence models, causing f
un test failures - Don't treat the smallest possible norm value as an infinitely long document in
SimilarityBase
or<wbr>BM25Similarity
LuceneTestCase
will now use standardized language tags to represent the randomizedLocale
- We now default the expert
applyAllDeletes
setting to true
when opening near-real-time readers RAMDirectory,
and all other directory implementations, now throwEOFException
if you seek beyond the end of the file- Dimensional values, new in Lucene 6.0.0, is now renamed to point values
- Geopoint query tests were failing because of a bug in the horizontal co-linear case, and fixing the geo tests to not pre-quantize their randomly generated lats and lons seems to have uncovered additional failure cases
SpanMultiTermQueryWrapper
no longer modifies its wrapped querySpanQuery
helpers to extractTermContext
objectsare now public BM25Similarity
is now more careful about checking the parameter values you pass to itStoredDocument
is now removed, clearing the way for 6.0.0 release soon- Our default
BytesTermAttribute
implementation hits NullPointe<wbr>rException
if the term is null - Some nice performance gains are coming to geo point queries by customizing how terms are created from the geohashes
IndexableField.tokenStream
should not throw IOException
?TokenStream
is complex to work with, and sometimes confusing when you violate one of its requirements- Minimum should match and synonyms struggle to co-exist in query parsers in Lucene 5.x
PrefillTokenStream
lets you specify exactly which tokens to iterate
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!