This Week in Elasticsearch and Apache Lucene - 2016-08-16
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Powering Transactions Search with Elastic – Learnings from the Field https://t.co/BbcGgmgeGV
— PayPal Developer (@paypaldev) August 11, 2016
Elasticsearch Core
Changes in 2.x:- The
mapper.allow_dots_in_name
setting disables the dots-in-fieldname check, which will allow users stuck on 1.x to upgrade to 2.4.0. - Update Jackson to 2.6.6 Final.
- Prebuild Japanese stopwords token filter.
- The completion suggester now returns documents as results instead of doc_values/payloads.
- Add support for upgrading field mappings which have dots in the fieldname to treat dots as path separators.
- Specifying more than one field name in the short query form now results in an exception instead of being silently ignored.
max_local_storage_nodes
now defaults to 1 - it must be overridden to start multiple nodes in the same data directory.- Snapshot deletions first check whether a restore is already in progress.
- Script compilation is now subject to a circuit breaker to enforce the use of named params.
- Explain with DFS query now uses global term statistics.
- Scroll requests in 5.0 were not being renewed.
- The BoostingQuery didn't work with the fast vector highlighter.
- Netty4 wasn't handling
Expect: 100-continue
headers correctly. - Added workarounds because Docker doesn't handle
seccomp
calls correctly. - The RoutingNodes interface has been cleaned up and minimised to make it easier to ensure its invariants.
- Keyword fields now use binary doc values instead of string to avoid encoding as UTF8 twice (once for indexing and once for doc values).
- The analyze API shouldn't result in caching tokenizers or filters.
- Analyzer aliases are no longer supported. Old indices using aliases will be upgraded correctly.
- Fatal errors such as OOM should cause the JVM to exit, even if thrown in unprivileged code like the scripting engine, but OOM and StackOverflow errors in Painless are safe to catch.
- Groovy asserts should not cause the JVM to exit.
- Most geo-distance helper methods in scripts have been removed in favour of arcDistance and planeDistance.
VersionFetchSubPhase
was fetching the docId, even though it was already known.- The
jvm.options
file wasn't handling spaces correctly on Windows. - Internet Explorer can't handle multiple CORS headers, but expects comma-separated values instead.
- The query slow log was missing node name and shard ID.
- The lang-javascript plugin now works with reindex.
NamedWritableRegistry
is now immutable and takes all readers at construction time. instead of relying on Guice for injection. Extension points exist for plugins.- Gradle now checks for
// norelease
again. - We should be explicit about which annotation processors should run.
- Benchmarks show that the Java HTTP client performs as well as the TransportClient.
- The rank evaluation framework continues to evolve.
- Should we add an option to not return metadata with hits?
- Apps like Kibana need to be able to reindex their indices when upgrading.
- If we can improve Lucene's caching, we can speed up primary key lookups during indexing.
- Should the
_all
field be disabled by default? - Suppressing AlreadyClosedException with mmap can be disastrous.
- Figuring out which shard allocations have been made should be less costly.
- Geo-distances calculations should only use arc, to be consistent with geo-distance queries.
- Log4j is being upgraded to Log4j2.
- Should implicit casting in Painless be implemented as its own phase, instead of piggybacking on the analysis phase?
Apache Lucene
- Lucene will soon try harder in its best effort check to detect when
MMapDirectory
is being used after being closed since that can cause a SIGSEGV which terminates the JVM - A doubt from a user about Lucene's newish query cache leads to adding a clarifying comment to Lucene's sources
- The Lagrangian bounds computation in
geo3d
had a degenerate corner case - Switching even numeric doc values to an iterator API, instead of random access is challenging
MockDirectoryWrapper,
used during Elasticsearch and Lucene tests to ensure the store level APIs are being used correctly, will now detect when a clone of a closedIndexInput
is being usedPrefixQuery
andAutomaton
now make slightly fewer object allocations- A new regular expression engine using Memory Occurence Automata may lead to better regular expression queries
MoreLikeThisTest
test keeps failing- Concordance searching, letting you click through every term hit and its surroundings in your result set, is now available for Lucene via maven
- Making delete-by-query work with doc-values queries is horribly complex and it may make more sense to remove doc-values queries instead
- The APIs to track external data structures along with Lucene's
LeafReaders
are trappy IntRangeField,
FloatRangeField
andLongRangeField,
letting you index a range and search by ranges overlapping the indexed ranges, are coming soon
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!