This Week in Elasticsearch and Apache Lucene - 2016-06-27
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
“How Airbnb manages to monitor customer issues at scale” by @AirbnbEng https://t.co/5BtU7Yc9Y6 #nodejs
— Joe McCann (@joemccann) June 15, 2016
Elasticsearch Core
Changes in 2.x:
- The .scripts index now obeys the number_of_shards setting.
- Deprecation logging for `_timestamp` and `_ttl`.
- Failed synced flushes were reporting an incorrect number of failures.
- The index-exists request shouldn't fail if the index is being recovered.
- A valid translog file can be deleted incorrectly after a disk full exception and multiple attempts to recover.
Changes in master:
- The low-level Java REST client has landed. It is functionally equivalent to the REST clients available in other languages.
- The `index.store.preload` setting can preload the specified Lucene files (eg doc values, norms) into MMAP before a segment comes online. This completes the replacement of warmers.
- The cluster health no longer turns red when creating an index, unless there is a problem assigning shards.
- The default similarity is now BM25.
- The `_timestamp` and `_ttl` fields will not be supported on indices created in 5.x.
- The `fields` parameter has been removed in favour of `stored_fields`, `docvalue_fields` and (for `text` fields only)`fielddata_fields`.
- Some percolator queries don't need in-memory validation to ensure that they match.
- Painless now has capturing lambdas, supports adding static methods like `each` to whitelisted classes, has syntax for initialising arrays, lists and maps,
- Nested inner hits no longer return _index, _type, and _id, and parent/child inner hits doesn't return _index.
- `string` fields weren't upgraded to `text`/`keyword` if `include_in_all` was specified.
- Getting a task with wait_for_completion will return the task result.
- Nodes info returns the calculated size of the total indexing buffer.
- Analysis factories are now MultiTermAware, which will help to remove the lowercase_expanded_terms from the query string query, and to support keyword analyzers on the `keyword` field.
- JNA is now a required dependency.
- Guice has been removed from the script service,
Ongoing changes:
- Sequence number checkpoints are persisted to disk when a segment is flushed.
- Reindex-from-remote now uses the Java REST client.
- Ensure that primary handover while indexing does not cause a dead lock.
- The index file which lists the snapshots in a repository should be written atomically.
- The `discovery-azure` plugin doesn't work with the security manager.
- It shouldn't be necessary to wait for status yellow before working with a newly created index.
- Add helpers to make JSON easier to render in Mustache.
- The SynonymQuery should be used for alternative terms, instead of the Bool query.
- More time zone edge case bug fixes.
- Changes to shard store fetching are required in order to allow for inline rerouting during node join.
- Analysis components should implement AnalysisPlugin instead of calling registerTokenizer, allowing Guice to be removed from Hunspell.
Apache Lucene
- 5.5.2 RC2 release vote is underway
- A tricky randomized
explain
test failure turns out to be a test bug in a recently added test case Math.toRadians
and Math.toDegrees are now banned, since their implementation changes slightly across java versions, impacting our geo testsRandomAccessFilterStrategy
comes back to life for faster filter intersection in some cases- Multi term queries that match no terms rewrite to
MatchNoDocsQuery
instead of an emptyBooleanQuery
, making it much simpler to add a helpful reason toMatchNoDocsQuery
- The new Ukrainian lemmatizer uses
MorfologikFilter
with a custom dictionary for efficient dictionary-based Ukrainian analysis - Lucene's confusing and bushy
IndexReader
hierarchy strikes again RAMDirectory
now also enforces write-once files, andMockDirectoryWrapper
now tries harder to corrupt unsync'd index files on closeGeoPoint
gets some code cleanups- Eclipse now also fails on unused imports
- Auto-prefix terms have been removed since dimensional points is better
CompressionTools
has been removedForbiddenAPIs
is upgraded to version 2.2- It's important to fsync files after copying them via Lucene's
Directory
!
- A tricky test failure was holding up the 5.5.2 release process
- Some minor code improvements to
SearchGroup
- Can we improve the default behavior of query parsers and multi-term queries?
- A test bug in
MoreLikeThisTest
still remains tricky to fix MoreLikeThis
should not invoketoString
on aField
objectScandinavianFoldingFilterFactory
andScandinavianNormalizationFilterFactory
are safe for multi-term queries- In the possibly not-rare case where many document share the same point value, we can better compress the
docIDs
- The ancient query norm and coord blocks progress and should be removed
- Should we add a light weight Ukrainian stemmer?
- Updating doc values and then using delete-by-query with a doc values query doesn't always work, but fixing it is likely not feasible
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!