Luca Wintergerst

Elasticsearch over the years — how LogsDB cuts index size by up to 75% at no throughput cost

By default, Elasticsearch is optimized for retrieval, not storage. LogsDB changes that. Here's the layered architecture behind a 77% index size reduction.

17 min read

Elasticsearch was built as a search engine. That heritage has a cost for log storage: every event fans out to multiple on-disk structures, each optimized for retrieval rather than compression. LogsDB changes both. On our nightly benchmark, Enterprise mode produces a 37.5 GB index from the same data that takes 161.9 GB without LogsDB — a 77% reduction from a single setting.

The write overhead

Lucene, the library underneath, keeps multiple structures for every indexed document:

  • The inverted index maps terms to documents. This is what makes text search fast.
  • _source stores the original JSON blob, returned when you fetch a document.
  • Doc values store field values in columns for sorting and aggregation.
  • Points / BKD trees index numeric and date fields for range queries.

The inverted index earns its keep: it's what lets you search a billion log lines by keyword in milliseconds, and there's no cheaper way to build that capability. _source exists to give you back exactly what you indexed: search results and GET requests return this blob directly. The problem is that it stores the full event even though the same field values are already available through doc values and the other structures.

Take a log event with fields like host.name, @timestamp, http.response.status_code, and duration_ms. The entire event is serialized as JSON in _source. The same field values are also written into doc values columns, indexed into the inverted index, and stored in BKD trees for range queries. Same data, multiple structures, each with its own on-disk footprint.

For a search engine where you need fast retrieval across all dimensions, that overhead is a reasonable tradeoff. For logs, where you rarely need the raw JSON and almost never do relevance-ranked search, much of it is pure waste.

One write, four on-disk structures: _source (the raw JSON blob), the inverted index, doc values columns, and BKD / points trees for numeric range queries. The same field values end up in multiple places.

Why columnar storage matters for compression

Doc values are the key to everything LogsDB does. Unlike _source, which stores entire documents as blobs, doc values store each field as a separate column across all documents in a Lucene segment.

Picture a segment with a million log events. The _source representation is a million JSON blobs, one per event, each containing all fields jumbled together. The doc values representation is a set of columns: one column of a million timestamps, one column of a million host names, one column of a million status codes, and so on.

Row-oriented _source keeps all fields for each document in one blob — doc0 through doc5 each carry host.name, @timestamp, status, duration_ms, and more jumbled together. Column-oriented doc values restructure the same data so all host.name values sit in one column, all timestamps in another, all status codes in another. Compression codecs can then run on each contiguous column independently.

That columnar layout is what makes per-column compression possible. When all values of http.response.status_code sit in a contiguous column, Lucene can apply codecs that exploit patterns in the sequence.

Delta encoding stores differences between adjacent values instead of full values. GCD encoding finds a common factor and divides everything down. Run-length encoding collapses repeats. Lucene picks the codec per segment and re-evaluates when segments merge.

Four sorted @timestamps from the same host, compressed in four stages. RAW: four 32-bit integers, 128 bits total. DELTA: store differences instead of full values — base stays, deltas +100, +200, +300 take 59 bits. GCD: divide out the common factor of 100, leaving 1, 2, 3 at 39 bits. BIT-PACK: pack those three small integers into contiguous bit storage, 9 bits freed.

But here's the catch: these codecs only work well when adjacent documents have correlated values. Consider the @timestamp column.

If logs arrive from dozens of hosts interleaved randomly, the timestamps in the column jump around. The delta between adjacent values might be +3 seconds, then -47 seconds, then +120 seconds. Delta encoding can't do much with that.

Now consider what happens if you sort by host.name and @timestamp before writing to the segment. All logs from host-A land in a contiguous run, followed by all logs from host-B, and so on. Within each host's run, the timestamps are monotonically increasing and the deltas are predictable.

Four timestamps from the same host might look like 1706745600, +100s, +200s, +300s. Delta encoding shrinks those to a base value plus three small integers.

GCD encoding finds that 100, 200, 300 are all divisible by 100 and stores 1, 2, 3 instead. Bit-packing then fits those three values into a handful of bits. The same pattern applies to fields like host.name, service.name, or http.response.status_code: within a sorted run, long stretches of identical values collapse to near nothing under run-length encoding.

Five hosts — api-01, api-02, db-01, web-01, web-02 — scattered randomly in arrival order (left). Sorting by host.name groups them into five contiguous blocks of eight (center). Run-length encoding collapses each block to a single (value, count) pair — 5 pairs stored instead of 40, the remaining slots freed (right).

Elasticsearch never sorted by default. Documents landed in arrival order, compressed with DEFLATE. We left a lot on the table.

How we got here: 2012–2026

Not all of the individual techniques in LogsDB were designed for logs. They were built over twelve years to solve different problems, and LogsDB is what happens when you stack them.

The foundation (2012–2017). Lucene 4.0 introduced doc values in 2012. By Elasticsearch 5.0 in 2016, they were on by default for all keyword and numeric fields. Lucene 7.0 added sparse doc values, so fields that only appear in some documents don't waste space on every document in the segment. That fixed a significant force-merge bloat problem (up to 10× on sparse fields) and set up the storage model everything else depends on.

Dense encoding reserves an 8-byte slot per document regardless of presence. Sparse encoding stores only documents that have a value at 12 bytes each (value + doc ID). For error_code with 2 of 16 docs populated (12% fill), sparse is 81% smaller: 24 B vs 128 B. For request_path at 88% fill, sparse is larger: 168 B vs 128 B. Lucene picks per field; sparse wins below ~67% fill.

Incremental wins (2020–2021). Two smaller changes targeted observability workloads. Dictionary-based stored fields compression deduplicated repetitive string metadata for about a 10% win.

The match_only_text field type dropped term frequencies and positions from the inverted index. Term frequencies are what BM25 uses to score documents by relevance — how often a term appears in a document relative to the rest of the corpus. For log search that signal is meaningless: you don't care whether "timeout" appeared twice or seven times in a log line, you just want to find it. Positions are similar: they're stored so Elasticsearch can do exact phrase matching, but the position data is expensive and phrase queries on logs are rare enough that the tradeoff is worth it. When you do run a phrase query on a match_only_text field, it still works — it just falls back to a slower path that rescores candidates rather than using stored positions directly.

text stores each term with its frequency and every position it appears at. match_only_text keeps only the doc IDs — enough to find the document, nothing more. The timeout term appears twice in this message (positions 1 and 4), which is exactly the kind of data that gets dropped.

Dropping frequencies and positions cuts the inverted index for a text field by roughly 40%. The overall index impact in 2021 was only ~10%, which sounds like a poor return on a 40% field-level reduction. The reason is where storage was going at the time: _source was stored in full for every document as a raw JSON blob, doc values were uncompressed and unsorted, and nothing was using ZSTD. The message field's inverted index was a small slice of a much larger, poorly-compressed whole. As the next five years of work addressed those other structures, the same 40% field-level savings became a meaningful fraction of a much smaller total.

Neither change was decisive on its own, but they established that log-specific storage optimization was worth pursuing.

The TSDB turning point (April 2023). This is where the story really starts. We shipped synthetic _source and index sorting for time series metrics in Elasticsearch 8.7.

Synthetic source changes the write-and-read contract. At write time, we skip storing the raw JSON blob entirely. At read time, when a query needs to return the original document, we reconstruct it by reading each field's value out of doc values and stored fields and assembling them back into JSON. The result is functionally equivalent to the original _source (with minor differences like field ordering), but we never stored the blob.

Index sorting groups documents by dimension fields and timestamp before writing to disk. Together, synthetic source and index sorting cut metrics storage by up to 70%.

That result told us something important: the same architecture could work for logs.

Without LogsDB, Elasticsearch writes every log event twice: once as a raw _source blob on disk, once into doc values columns. LogsDB skips the blob entirely. At read time, a GET <index>/_doc/1 request gathers field values from doc values and assembles the document on the fly.

The TSDB codec (2024). In 8.13 and 8.14, we built a custom doc values codec with run-length encoding optimized for sorted consecutive values, PFOR-delta encoding, and cyclic ordinal encoding for multi-valued dimensions. The numbers were striking: kubernetes.pod.name doc values dropped from 110 MB to 7.25 MB in one benchmark. We extended coverage to all numeric and keyword types including ip, scaled_float, and unsigned_long.

LogsDB Tech Preview (August 2024). In 8.15, we combined everything into index.mode: logsdb: host-first sorting, synthetic _source, ZSTD compression, and the TSDB numeric codecs. One decision mattered more than expected: sort order. Sorting by host.name first, then @timestamp, delivers up to ~40% storage reduction. Sorting by timestamp first gives ≤10%. The host-first ordering co-locates documents that share field values, which is exactly what the numeric codecs need.

LogsDB requires auto-generated _id; user-provided IDs disable both synthetic source and the routing optimization.

ZSTD and GA (November–December 2024). In 8.16, we switched best_compression from DEFLATE to ZSTD permanently (level 3, blocks up to 2,048 documents or 240 kB, native bindings via Panama FFI on JDK 21+). ZSTD gave us ~12% smaller stored fields and ~14% higher indexing throughput at the same time, which almost never happens. LogsDB went GA in 8.17.

At GA, we claimed up to 65% storage reduction.

Routing and recovery (April 2025). In 8.18, route_on_sort_fields started routing documents to shards by sort field values instead of _id. Without this optimization, Elasticsearch hashes the _id to pick a shard, so logs from the same host scatter across all shards. With routing on sort fields, logs with similar host.name values land on the same shard. This co-locates similar documents at the shard level, not just within segments, adding ~20% storage reduction at a 1–4% ingest penalty.

Data stream .ds-logs-nginx-default-00001 with six hosts across three shards. STANDARD (hashed by _id): all host colors scattered randomly. ROUTED (route_on_sort_fields): same-host logs land on the same shard, but remain in arrival order within it. ROUTED + SORTED (host-first sort): each shard contains contiguous blocks of a single host — the combination that lets numeric codecs and RLE reach their full potential.

We also switched peer recovery to synthetic source reconstruction, eliminating the duplicate _recovery_source blob. In 9.0, logs-*-* indices default to LogsDB.

Nightly synthetic source benchmark, December 2024. Index size written drops 39% — from ~279 GB to ~171 GB — the day peer recovery switches from copying the raw _recovery_source blob to reconstructing documents from doc values.

Merge and recovery overhaul: 9.1 (July 2025). We fully eliminated the recovery source. Peer recovery uses batched synthetic reconstruction, cutting write I/O by ~50% and boosting median indexing throughput ~19% over the 8.17 baseline. We replaced up to four separate doc values merge passes with a single pass, cutting background merge CPU by up to 40%. And we swapped _seq_no's BKD tree for Lucene doc value skippers, halving _seq_no storage.

pattern_text and Failure Store: 9.2–9.3 (October 2025–February 2026). In 9.2, we shipped pattern_text as a Tech Preview: a new field type that decomposes log messages into static templates and dynamic variable parts. A log line like Session opened for user alice from 10.0.1.42 via TLS gets split into the template Session opened for user {} from {} via TLS (stored once, as a template ID) and the variables alice, 10.0.1.42 (stored per document). For logs with high template repetition, this cuts message field storage by up to 50%. A companion template_id sub-field lets you sort by template, and the LogsDB setting index.logsdb.default_sort_on_message_template enables this automatically. pattern_text went GA in 9.3.

TEXT stores each log message as a full string per document — eight copies of near-identical blobs. PATTERN_TEXT decomposes them: the shared template Session opened for user {} from {} via TLS is stored once with ID T0, and only the variable columns (user, ip) are stored per document — alice/10.0.1.42, bob/10.0.1.87, carol/10.0.2.11, and so on.

pattern_text does come with an indexing CPU cost: decomposing each message into template and variables takes more work at write time than storing a raw string. Whether that tradeoff makes sense depends on your dataset and your priorities.

If your log messages follow highly repetitive patterns (structured application logs, Kubernetes events, access logs), the storage wins are large and the CPU overhead is bounded. If your messages are free-form or low-repetition, the compression gains shrink while the CPU cost stays roughly the same.

For data you keep for months or years, the cumulative storage reduction usually makes it worthwhile. For high-cardinality, rapidly changing messages where storage isn't the constraint, it may not be.

9.3 also brought compression for binary doc values, making wildcard field types significantly more storage-efficient. Internally, wildcard fields store an inverted index of trigrams in a binary doc values column; that column is now compressed with Zstandard instead of being stored raw. In one benchmark, a URL field dropped from 2.92 GB to 1.12 GB, more than 60% compression. If you use wildcard fields heavily, the gain is automatic with no mapping changes needed.

Also in 9.3, skip lists for @timestamp and host.name became available as an opt-in for LogsDB. Skip lists let Elasticsearch jump ahead in a doc values column without reading every entry, which speeds up time-range queries on large segments. Other index modes have skip lists disabled by default; in LogsDB you can enable them selectively for the fields you range-query most.

Also in 9.3, the Failure Store became enabled by default for logs-*-* data streams. Failed documents (mapping conflicts, ingest pipeline errors) now land in dedicated ::failures indices instead of being rejected, which means LogsDB's strict synthetic source requirements are less likely to cause silent data loss during migration.

Performance, not just storage

LogsDB started as a storage optimization, and the early releases came with a throughput cost — sorting, synthetic source reconstruction, and ZSTD all add work at write time. Over two years of releases, we clawed that back. Indexing throughput is now on par with what users had before enabling LogsDB. You get the storage reduction without giving up the ingest rate you were used to.

Throughput (teal) has climbed from ~25k to ~35k docs/s since the Tech Preview. Storage on disk (blue) has dropped from ~65 GB to ~36 GB on the same benchmark dataset. Both curves move in the right direction, driven by the same layered releases: ZSTD in 8.16, routing optimization in 8.18, the merge and recovery overhaul in 9.1. Live numbers at elasticsearch-benchmarks.elastic.co.

The two trends compound each other. Less storage means fewer segments to merge, which frees CPU for indexing. Synthetic source reconstruction is cheaper to compute than it is to store and replicate the raw blob. Each release that shrank the index also reduced background I/O, which fed back into throughput.

The practical result: if you were running standard Elasticsearch for log ingestion two years ago, the throughput you had then is roughly what LogsDB delivers now — with a 50–75% smaller index alongside it.

How to enable it

As of 9.0, logs-*-* data streams default to LogsDB automatically. If your data streams match that pattern, you're already using it.

Want a hands-on walkthrough? Cut Elasticsearch log storage costs by 76% with LogsDB walks through creating two indices, reindexing, and measuring the difference with the _stats API — including version-specific enable instructions for 8.x clusters.

For other index patterns, set it in your template:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.mode": "logsdb"
    }
  }
}

Synthetic _source turns on automatically with index.mode: logsdb.

For the routing optimization (8.18+), add one more setting:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "index.mode": "logsdb",
      "index.logsdb.route_on_sort_fields": true
    }
  }
}

This routes shards by sort field values instead of _id, adding ~20% storage reduction at a 1–4% ingestion penalty. It requires at least two sort fields beyond @timestamp and auto-generated _id.

Switching an existing index to LogsDB requires a reindex. So does rolling back. There's no in-place conversion, so try it on new data streams first.

Storage improves further as segments merge — freshly written data compresses well, but merged segments compress even better.

What's next

Elasticsearch still carries some structural overhead from its search engine roots. _id and _seq_no are two examples: both consume meaningful disk space (on small documents they can account for more than half the index size), but neither is essential for log analytics workloads.

We've already taken the first step for TSDB: PR #144026 eliminated stored _id bytes from TSDB indices by reconstructing the field on the fly from doc values, the same approach synthetic _source uses. We're exploring the same direction for LogsDB.

9.4 and beyond. The architecture still has room to improve, and we're on it.

For the full reference, see the logs data stream documentation.

Share this article