This Week in Elasticsearch and Apache Lucene - 2015-11-30
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
From the Found vault: Understanding the Memory Pressure Indicator https://t.co/rQiXobeeDr#Elasticsearch
— elastic (@elastic) November 27, 2015
Elasticsearch Core
Changes in 2.1:
- Users were unable to upgrade indices using field datatypes that have been moved to plugins (_size, murmur3, attachment) as the mapping upgrade checks happened before the plugins registered their datatypes.
- All shards logged a nasty (but innocent) warning about a missing translog file when starting 2.1.0.
- Similarly, writing to already closed translogs was causing confusing log messages which should have been handled more gracefully.
- We were not logging the root cause when failing to change a shard's indexing buffer.
- When using field constraints with the field stats API, the response should exclude indices that do not contain the field
- Added sanity checks that the Lucene version that an Elasticsearch Version object declares is consistent with the indices that we use to test backward compatibility.
Changes in 2.x:
- The mapper attachments plugin will be available again in Elasticsearch 2.2.
- The field stats API should return both the string and numeric values of date fields for consistent sorting.
- Stats for the completion suggester, which often appeared in hot threads output, is now more efficient.
- The delayed allocation time for unassigned shards should be updated on every reroute calculation.
- Locking down of JVM permissions continues, now preventing the JVM from spawning new processes on BSD/OSX and on Windows.
- Most of the BWC tests have been reenabled, bar one for the FunctionScore query which still depends on Groovy.
- Upgraded Lucene 5.4 to a snapshot which includes a fix for the off-by-one issue with sparse doc value fields.
Changes in master:
- All query parsers now use ParseField, making it easier to support future deprecations.
- Allocation IDs are now persisted in the shard state metadata, and the current allocation IDs will soon be persisted in the index metadata. This will make it possible to choose the most recent shard copy when recovering an index.
- Full exception objects are now logged in many places instead of just the output of getMessage.
- More tests have been refactored to not rely on Groovy.
- The following aggregations have been refactored: geobounds, scripted metric, cardinality, filter, missing, nested and reverse_nested, children, cumulative sum, and geo-centroid, leaving 19 still to do.
- Most GeoShape builders are now Writable, with an open PR for the remaining builders.
- CIDR expressions were parsed too leniently.
- Many many PRs to improve the Gradle integration.
Ongoing:
- Work has started on the reindex and task manager APIs.
- Query profiler being updated after review.
- Splitting the `string` field type into `text` and `keyword` fields.
- Adding a mechanism for batching up cluster state updates, and using that in the tribe node, and when processing shard started and shard failed events.
- Make the BulkProcessor back off and retry when the bulk queue is full.
- Add detail response support for _analyze API
- The `fields` option should only load stored fields, not from _source. This also allows it to support wildcards.
- Node ingest:
- Individual items in bulk requests can now fail without failing the whole bulk request.
- The mutate processor has been split out into a processor per function.
- Added a meta processor for meta fields like _index. Turns out meta fields are reserved anyway, and so can be handled by standard processors instead.
- Ingest should be able to access and modify list items.
- Much discussion on how to handle processors that fail.
Apache Lucene
- The 5.4.0 release branch is cut!
- Don't use
null
to represent sorting by relevance! GeoPointDistanceQuery
matches the wrong documents when it has to cross the datelineFacetsConfig.getDimConfig
need not be synchronized since it's using aConcurrentHashMap
under-the-hood - 1D dimensional values is now faster, for indexing and searching, and smaller, for index size on disk and heap used at search time, than numeric fields
- If you search on a massive shape, such that it manages to wrap around and span the entire earth, we should rewrite that to just match all documents with the field
- The Python script to workaround slow Apache git mirror keeps getting better
- Sparsely populated doc values on a large segment tickled an off-by-one bug, discovered by Elasticsearch nightly benchmarks
PhraseQuery
does, in fact, allow more than one term at the same position, but it's interpreted differently (conjunction) thanMultiPhraseQuery
(disjunction) - We can once again sync directory metadata on Java 9, but the OpenJDK issue is still open
IOUtils.fsync
should not retry on hittingIOException
since that means a great disturbance has occurred- Factor out components of the XML query parser so consumers can inherit from classes without requiring
sandbox
module code - Optimize stored fields retrieval avoid skipping the last field in the document if it's not needed, but this will only help if the last field is more than 16 KB at default settings
- A Coverity scan of Lucene found some minor things to fix
JapaneseTokenizer
should offer more then two possible tokenizationsEquals
methods are tricky- Should we remove
Scorer.getChildren
? - Allow
XMLQueryParser's
TestParser to be extended so Solr test cases can subclass - The new sandbox geo APIs struggle with large errors when the shapes are very large
JoinUtil
should support joins on numeric doc values fields
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!