This Week in Elasticsearch and Apache Lucene - 2016-12-12
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
End-to-end Recommender System with #ApacheSpark and #Elasticsearch https://t.co/g6OmcVaz0D
— Spark Tech Center (@apachespark_tc) December 5, 2016
Elasticsearch Core
Changes in 2.x:
- Add a HostFailureListener to the Transport client to notify client code if a node is disconnected.
- Fix NPE when referencing a non-existent field data type in an
attachment
sub-field.
Changes in 5.1:
- Field names with dots should not allow intermediate
nested
fields, onlyobject
fields. - Netty should be able to read the system-wide configuration for socket connection backlog on Linux from
/proc/sys/net/core/somaxconn
. - Preserve the original hostname in DiscoveryNode and TransportAddress, and when pinging.
- SearchTemplateRequest should work with wildcard indices when running with Security enabled.
- Reduce memory pressure and garbage collection when sending very large term queries.
- Wildcard and span queries were not working on specialised fields like
_all
and_type
. - Add an option to skip install-time configuration of vm.max_map_count on systemd distros.
- Set JVM thread stack size on Windows for 64bit JVMs.
- Fixed handling of multiple spaces in paths on Windows.
- Return correct term statistics when a field is not found in a shard.
- In-sync shard lists should only be trimmed when the list grows, to avoid removing valid shard copies too early.
- Create requests should reject external versioning.
- A shard marked locally as relocated should be allowed to flush and forced_merge.
- The slow log should be single line, not pretty-printed.
- The FiltersAggregationBuilder should rewrite filters early to avoid exceptions when using
now
.
Changes in 5.x:
- The
synonyms_graph
token filter now allows multi-token synonyms to work correctly with phrase and proximity queries. - Added new numeric and date range fields, along with
range
query support. - Synonyms in cross-field multi-match query were not being expanded to all fields.
- Replaced connectToNodeLight with connection profiles, and use profiles to reduce the number of connections needed for non-data nodes. Also support per-node connection timeouts.
- Fail node joins, index restore, open, or upgrade for index versions which are not supported by all nodes in the cluster.
- When reindex and friends have to retry a request, they should do so with the same context as the original request.
- Enabled system call filters which cannot be applied should prevent node startup.
- The task manager now returns human readable descriptions for ongoing snapshot and restore tasks, reindex, delete-by-query, and update-by-query tasks, search tasks, and bulk tasks.
- Scripts should treat
ip
fields as strings. - Unindexed fields are now visible in the field-stats API.
- X-Content parsing has been moved from RestAction to be centralised in RestRequest.
- Nodes should complete the handshake process before publishing the connection.
- Fuzzy query is no longer deprecated.
Changes in master:
- Removed the old
default
store type (hybrid NioFS and MMAPFS) - Removed the deprecated
indices
query. - Do not reply to ping requests from other clusters.
- Do not update nodes list when stepping down as master.
Ongoing changes:
- Work has started on adding a high level Java REST client with Java dev-friendly requests, builders and response parsers.
- When promoting a replica to be primary in a mixed node cluster, choose a replica on a node with the lower version.
- Deprecate shadow replicas.
- Single index and delete operations can be replaced by the bulk API internally.
- Enable strict duplicate checks in JSON.
- Expose disk usage in node stats.
- A low-level protocol handshake would exchange version API info whenever a new transport connection is made, allowing for version changes without cluster restarts.
- The
indices-boost
query learns to support wildcard indices and aliases.
Apache Lucene
- The latest Java 9 early access build broke Lucene's best-effort attempt to use unmap with
MMapDirectory
, causing Uwe Schindler to post a stern email to the Jigsaw dev list - Dimensional points now require substantially less search-time heap in some cases, as seen by the Lehman Brothers magnitude drop here (~59% reduction, annotation
R
) - Numeric doc values should not let outliers, such as taxi cabs that can drive faster than the speed of light, blow up index storage of all documents
UnifiedHighlighter
now lets you highlight text from other fields- Lucene should sort new segments when they are initially written, giving a nice speedup to sorted indexing throughput
- Buffering up small leaf-block writes gives a small boost to dimensional points indexing (annotation
S
) - If a merge hits a tragic exception while commit is running it may cause
IndexWriter
to deadlock - Lucene should not let you update a doc values field if it is used in the index sort
- A new
SpansTreeQuery
recurses the tree of span queries to compose the score based on the type of sub queries, implementing a 9 year old idea suggested on Lucene's users list FacetQuery
andMultiFacetQuery
are new sugar classes to make it easier to implement facet drill downs- Programs like virus checkers can cause
IndexWriter's
commit
to fail when it tries tofsync
files - We should enable more pre-commit checks from the Eclipse
ecj
compiler we already use in our build DrillSideways,
letting you see other facet counts even after you've drilled down, should use threads to gain concurrency- We should use the same string constants for the same parameter names across our analysis factories
- The
Terms.intersect
API is less trappy now - The
Terms.intersect
API is trappy IndexWriter's
javadocs concerning NFS andIndexDeletionPolicy
were confusingPrefixCodedTerms
should perhaps cache its hash-code, though it's controversial and perhaps only queries should do soUnifiedHighlighter
should use theSpanCollector
API to be more accurate with nested span queries, and it should let you highlight text from other fields too- If
IndexWriter
hits a tragic event when too many merges are running it can lead to deadlock - Lucene's maven build fails to generate the javadocs jar
- The
smartcn
tokenizer does not handle appended Chinese punctuation marks correctly
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!