This Week in Elasticsearch and Apache Lucene - 2016-06-20
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Sat on this draft for a week, and well, it's Friday. Yolo. ElasticSearch at petabyte scale on AWS: https://t.co/XuEAbfFJsk
— Jamie Alquiza (@jamiealquiza) June 17, 2016
Elasticsearch Core
Changes in 2.x:
- More timezone bug fixes with DST edge cases.
- Admin and diagnostic HTTP requests are excluded from the in-flight circuit breaker.
- Individual bulk items now carry headers and context for security checks.
- The AWS cloud plugin now throttles retries.
Changes in master:
- Upgraded to Lucene 6.1.0.
- Painless has support for regexes with native regex syntax, regex flags, , find() and match() operators, also array constructor references, more efficientmaths operations, less boxing required when types are known, more efficient dynamic operators, non-capturing lambdas. The syntax for a single-expression lambda is simpler.
- Fields of type `half_float` use 16 bits to represent a smaller range of floats, ideal for metrics, instead of 32 for `float`.
- Terms aggs automatically choose `breadth_first` mode when it makes sense to reduce memory usage and improve performance.
- The index-rollover API will rollover an alias to a new index when the existing index is too big or too old.
- Long running tasks now persist their results to the `.tasks` index.
- Field stats requests are cached in the results cache.
- Ingest has a script processor.
- Removed support for size=0 in aggregations as it hides the real cost from the user.
- Selected Lucene files can be preloaded into mmap on refresh.
- Search preferences _prefer_node:id and _only_node:id have been removed in favour of _prefer_nodes:spec and _only_nodes:spec.
- Elasticsearch now has infrastructure for microbenchmarks.
- The plugin installer emits a more understandable error message when an incorrect plugin name is specified, and provides did-you-mean suggestions.
- Node join updates to the cluster state are processed in batches, and the first cluster state after a node joins can now include allocations to that node.
- Individual _msearch responses include the HTTP status code.
- Script field entries are now returned even if the value is null.
- We no longer fork Joda.
- Replication tests test replication with real shards, but without the overhead of a full node.
- Groovy scripts are compiled in their own classloader, which can be GC'ed by Java.
- A shard should be able to cancel check index when it is closed.
- The quest to remove Guice from the codebase continues.
Ongoing:
- Reindex-from-remote is waiting for the Java HTTP client to be merged.
- Configurable shard weights for better shard balancing is proving tricky.
- Creating an index should not turn the cluster red.
- Get-task with ?wait_for_completion should return the task result.
- Sequence number checkpoints should be persisted to disk when a segment is flushed.
- The analyze API should support configuration of custom tokenizers and filters.
- Mustache should know how to render JSON variables.
- The percolator continues to receive performance improvements.
Apache Lucene
- We now create new files using the
StandardOpenOption.CREATE_NEW
flag, to ensure index files are really write-once - A new LSH (locality sensitive hashing)
TokenFilter
and query is an alternative to the standardMoreLikeThisQuery
- The new Ukrainian lemmatizer uses
MorfologikFilter
with a custom dictionary for efficient dictionary-based Ukrainian analysis MemoryIndex.toString
breaks if payloads are used- Lucene now catches you if you try to re-open a near-real-time reader after forcefully recreating your index
WordDelimiterFilter
should respect theKeywordAttribute
but not change incoming keywords tokensIndexWriter's
commit data is now late-binding- ASM is updated to version 5.1
- How can we fix grouping and early termination to work correctly together?
- Heatmap facets would sometimes produce incorrect counts
TermRangeQuery
has outlived its usefulness- One of our randomized tests discovered the the file "
con"
is illegal on windows - Multi term queries that match no terms rewrite to
MatchNoDocsQuery
instead of an emptyBooleanQuery
, as a possible pre-cursor to adding a helpful reason toMatchNoDocsQuery
GeoTestUtil
is now gone, replaced by more stressful randomized point and shape generation- A tricky test failure turned out to be a concurrency bug in
IndexWriter's
new sequence numbers - Some link anchors in Lucene's javadocs were broken links
CharacterUtils,
containing helpful APIs for working with Unicode, has leftover cruft from the Java 4 days- Pre-commit now fails on unused imports and we have removed all existing unused imports
- A crazy test bug led to thread starvation which sometimes caused the test to timeout at 2 hours
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!