Fixing Elastic Streams processing failures without dropping data

If you've run a Streams pipeline for more than a week, you've probably hit a processing failure. Before Streams, that often meant dropped data or a dead letter queue at the shipper layer: extra infrastructure you had to operate separately. Here's the recovery loop today.

When processing fails, data lands in the failure store

When a Streams pipeline fails (a Grok pattern doesn't match, a field type conflicts with the mapping), the documents that caused the failure are written to the failure store. The failure store is a set of backing indices attached to your data stream. It scales the same way as any other data stream, so it can absorb everything that fails. It's enabled by default for logs as of Elasticsearch 9.2.

The Data quality tab gives you insights into the quality of your stream and into documents in the failure store. When failures are accumulating, you'll see a rising count of failed documents along with the error type and a sample of the messages that triggered it.

The Data quality tab showing a rising failure count, error type, and a sample of the documents that triggered it.

A Grok expression mismatch (illegal_argument_exception) is sending documents to the failure store. The raw log line doesn't match the expected pattern. The documents aren't dropped. They're in the failure store, ready to debug against.

Processing: Switch the sample source to the failure store

Start by navigating to the Processing tab.

By default, the editor samples from recent live documents. Switch the sample source to Failure store instead: it loads the exact documents that failed, the unmodified originals before any Streams processing ran. You're iterating against the actual failures.

Change the sample source dropdown from the default to Failure store.

The sample source dropdown with the Failure store option selected.

The editor loads up to 100 documents from the failure store and runs them through the current pipeline. You can see exactly where parsing breaks down.

The pipeline editor loaded with documents from the failure store instead of recent live samples.

Fix the processor against the actual failures

With the failure store documents loaded as samples, iterate on the processor. The editor shows you the result against the actual failed documents in real time.

In this example, the pipeline was originally built to parse HTTP access logs:

DELETE /api/v1/auth/logout from 26.72.241.177 - Status: 200 - Response time: 38ms - Request ID: req_24363339 - Location: São Paulo, BR - Device: desktop
HEAD /api/v1/notifications from 20.94.145.254 - Status: 202 - Response time: 60ms - Request ID: req_74513322 - Location: Tokyo, JP - Device: mobile

The original Grok pattern matched those:

%{WORD:http.method} %{URIPATH:uri.path}

A second log type started flowing in. Cache hits and external API calls arrived in a different format:

cache_hit: Cache hit for key: config
external_api_call: External API call completed - latency: 1695ms - Duration: 598ms

The original pattern doesn't match these at all. Every one goes straight to the failure store. With the failure store loaded as the sample source, the problem is immediately obvious: the editor shows the parse failing on lines that start with a word followed by a colon, not an HTTP method followed by a path.

The fix is a second pattern to handle the new format:

%{WORD:event.type}: %{GREEDYDATA:message}

Add it to the processor, and the editor immediately shows both log types parsing correctly against the failure store samples.

When the sample view shows all fields extracting correctly and the parse rate hits 100%, the fix is ready.

Both log types parsing correctly after adding the second Grok pattern. Parse rate at 100%.

No guessing — the editor confirms the fix before you save.

Watch the failure count drop

Save the updated pipeline. New documents are now processed with the corrected pipeline. Switch back to the Data quality tab and watch the failure count.

The failure count dropping as new documents are processed by the corrected pipeline.

The count drops as the fixed pipeline handles new incoming data correctly. The remaining documents in the failure store are the pre-fix failures. They'll clear out as retention ages them off.

The fix applies to new documents only. Documents already in the failure store aren't automatically reprocessed; each was processed by the pipeline version active when it arrived. If you need them in your main stream, that's a separate step.

The recovery loop

Open Data quality, switch to the failure store, fix the processor, save. The whole thing takes a few minutes at most.

No re-ingestion from source. No shipper-level dead letter queue to operate. If you haven't checked the Data quality tab for your streams recently, it's worth a look. There might be failures sitting there that a one-line fix would clear.

For a deeper look at what the Data quality tab shows and how to configure the failure store, see Elastic Observability: Streams Data Quality and Failure Store Insights.