Elasticsearch 7.14 introduces match_only_text, a new field type that can be used as a drop-in replacement for the
text field type in logging use cases with a much lower disk footprint, leading to lower costs.
Elasticsearch is attractive for log analysis thanks to its ability to index log messages. Want to count how many log messages contain
access denied in the last 24 hours? Elasticsearch can give you the answer in milliseconds thanks to its index structures — but index structures take CPU time to build and need disk space. You could save this CPU and disk space by not indexing your
message fields, but then you would also lose the ability to query your logging data in an interactive way.
In order to reduce disk space requirements,
match_only_text only indexes a subset of the information that
text fields index. This brings the following downsides:
- Relevancy scores are computed as the number of matching terms. This typically doesn't matter for logging use cases, as documents are sorted by descending timestamp rather than by relevance score.
- Span queries are unsupported. All other types of queries that are supported on
textfields are also supported on
- Phrase and intervals queries run slower than with
textfields, yet still much faster than a linear scan. Other types of queries run as fast if not slightly faster than on
We ran a variety of benchmarks with this new field type and observed an average 10% reduction of the size of indices containing application logs. This is the second significant index size decrease we are introducing in recent months — in 7.10, we introduced another reduction of around 10% in index size through an improvement to stored fields compression.
As of 7.14, application logs indexed with Elastic Agent will use
match_only_text instead of
text for the
message field of application logs, and we plan to roll out this change to our integrations starting with 7.15. All you have to do to benefit from these space savings is to upgrade to a new version of the Elastic Stack.
How does it work under the hood?
match_only_text is a new field type that trims everything that
text fields compute and index that is not crucial for log analysis:
- length normalization factors
- term frequencies
Length normalization factors and term frequencies are only used to compute scores, so dropping them was an easy win given that relevance scoring is typically useless for logging use cases, since events are sorted by descending timestamp rather than by relevance.
Positions are more interesting, since the
text field type uses them to run positional queries such as phrase queries or intervals queries. So how does
match_only_text run phrase queries? This new field type borrows ideas from runtime fields. In order to run phrase queries, it will load the value of your field from the
_source of the document to check whether terms actually occur at consecutive positions. But it does so only when strictly necessary to verify whether a document matches. For instance, if your query is
log.level: warn AND message:"node left" and you have a range filter on your
@timestamp field, Elasticsearch will only load the
_source of documents that match all required clauses as well as the terms of the phrase query. So in this case, it will only load the
_source of documents that:
- match the range filter on
- and contain both
leftin their message field.
As a result, while
match_only_text performs slower than
text on phrase and intervals queries, it still performs much better than a linear scan.
We encourage you to try it out in your existing deployment, or spin up a free trial of Elasticsearch Service on Elastic Cloud, which always has the latest version of Elasticsearch. We’re looking forward to hearing your feedback, so please let us know what you think on Discuss.