Pipeline Testing Guide

Pipeline tests validate your Elasticsearch ingest pipelines by feeding them test data and comparing the output against expected results. This is essential for ensuring your data transformation logic works correctly before deploying to production. Input to the tests are log or json files at the point that they would be ingested into elasticsearch, after any agent processors would run on a real integration. Output for the tests are documents after they have been processed by the ingest pipeline, and would be written to Elasticsearch indices in a real integration.

For more information on pipeline tests, refer to https://github.com/elastic/elastic-package/blob/main/docs/howto/pipeline_testing.md.

Quick Start

		# Start Elasticsearch
elastic-package stack up -d --services=elasticsearch

# Run pipeline tests
cd packages/your-package
elastic-package test pipeline

# Generate expected results (first time setup)
elastic-package test pipeline --generate

# Clean up
elastic-package stack down
		
	

What Pipeline Tests Validate

Pipeline tests verify:

Field extraction and parsing logic
Data type conversions and formatting
ECS field mapping compliance
Error handling and edge cases

Test Structure

Pipeline tests live in the data stream's test directory:

		packages/your-package/
  data_stream/
    your-stream/
      _dev/
        test/
          pipeline/
            test-sample.log
            test-sample.log-config.yml
            test-sample.log-expected.json
            test-events.json
            test-events.json-expected.json
		
	

Raw log input
Test configuration (optional)
Expected output
JSON event input
Expected output

Test Input Types

There are two input types for pipeline tests, raw log files and JSON event files. They are differentiated by their extension; raw log files use .log and JSON event files use .json.

Raw Log Files

Best for testing log-based integrations. Use actual log samples from your application.

Example: test-access.log

		127.0.0.1 - - [07/Dec/2016:11:04:37 +0100] "GET /test1 HTTP/1.1" 404 571 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36"
127.0.0.1 - - [07/Dec/2016:11:04:58 +0100] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0"

Advantages:

Use real application logs
Natural multiline handling
Easy to collect samples from production
Good for regression testing

JSON Events

Best for testing structured data inputs or when you need precise control over input fields.

Example: test-metrics.json

		{
  "events": [
    {
      "@timestamp": "2024-01-15T10:30:00.000Z",
      "message": "{\"cpu_usage\": 85.2, \"memory_usage\": 1024}",
      "agent": {
        "hostname": "web-server-01"
      }
    },
    {
      "@timestamp": "2024-01-15T10:31:00.000Z",
      "message": "{\"cpu_usage\": 72.8, \"memory_usage\": 896}",
      "agent": {
        "hostname": "web-server-01"
      }
    }
  ]
}
		
	

Advantages:

Precise control over input data
Perfect for metrics and structured data
Easy to test edge cases
Good for mocking complex scenarios

Test Configuration

Configure test behavior with optional -config.yml files:

Basic Configuration

Example: test-access.log-config.yml

		# Add static fields to all events
fields:
  "@timestamp": "2020-04-28T11:07:58.223Z"
  ecs.version: "8.0.0"
  event.dataset: "nginx.access"
  event.category: ["web"]

# Handle dynamic/variable fields
dynamic_fields:
  url.original: "^/.*$"
  user_agent.original: ".*"
  source.ip: "^\\d+\\.\\d+\\.\\d+\\.\\d+$"

# Fields that should be keywords despite numeric values
numeric_keyword_fields:
  - http.response.status_code
  - network.iana_number
		
	

Regex pattern matching
Any user agent
IP addresses

The fields section defines fields which will be added to all events before the ingest pipeline is run on test data.

The dynamic_fields allows pipeline tests to handle dynamically changing test results, by comparing the actual results for the field to the specified pattern, rather than static values.

The numeric_keyword_fields section identifies fields whose values are numbers but are expected to be stored in Elasticsearch as keyword fields.

Multiline Configuration

For logs that span multiple lines:

Example: test-java-stacktrace.log-config.yml

		multiline:
  first_line_pattern: "^\\d{4}-\\d{2}-\\d{2}"

fields:
  "@timestamp": "2024-01-15T10:30:00.000Z"
  log.level: "ERROR"
		
	

Date at start of new entry

Advanced Configuration

Example: test-complex.log-config.yml

		# Static fields
fields:
  "@timestamp": "2024-01-15T10:30:00.000Z"
  event.dataset: "myapp.logs"
  tags: ["test", "development"]

# Dynamic patterns
dynamic_fields:
  # Match any UUID format
  user.id: "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
  # Match any session ID
  session.id: "^[A-Za-z0-9]{32}$"
  # Match timestamps in different formats
  "@timestamp": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}"

# Convert these numeric values to keywords
numeric_keyword_fields:
  - process.pid
  - http.response.status_code

# Multiline Java stack traces
multiline:
  first_line_pattern: "^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"
  max_lines: 50
		
	

Expected Results

Define expected output in -expected.json files:

Example: test-access.log-expected.json

		{
  "expected": [
    {
      "@timestamp": "2016-12-07T10:04:37.000Z",
      "event": {
        "category": ["web"],
        "dataset": "nginx.access",
        "outcome": "failure"
      },
      "http": {
        "request": {
          "method": "GET"
        },
        "response": {
          "status_code": 404,
          "body": {
            "bytes": 571
          }
        },
        "version": "1.1"
      },
      "source": {
        "ip": "127.0.0.1"
      },
      "url": {
        "original": "/test1"
      },
      "user_agent": {
        "original": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36"
      }
    }
  ]
}
		
	

Running Pipeline Tests

Environment Setup

		# Start only Elasticsearch (faster than full stack)
elastic-package stack up -d --services=elasticsearch

# Verify Elasticsearch is running
curl -X GET "https://localhost:9200/_cluster/health"

Basic Test Execution

		# Run all pipeline tests in current package
elastic-package test pipeline

# Run tests for specific data streams
elastic-package test pipeline --data-streams access,error

# Run with verbose output
elastic-package test pipeline -v

# Run tests and show detailed diff on failure
elastic-package test pipeline --report-format human
		
	

Generating Expected Results

Use this for initial test setup or when updating pipelines. --generate will write (or overwrite) the expected files with the output from the current ingest pipelines.

		# Generate expected results for all tests
elastic-package test pipeline --generate

# Generate for specific data streams
elastic-package test pipeline --data-streams access --generate

# Review generated files before committing
git diff _dev/test/pipeline/
		
	

Verify the correctness of the generated expected files. elastic-package will create the expected files from the output of the current ingest pipeline. It cannot know if this is actually correct; you will need to verify this. If the expected files are not correct, you'll need to iterate by updating the ingest pipeline and regenerating the expected files until they are correct.

Workflow tip:

Create test input files first
Run with --generate to create expected results
Review generated output for correctness
Commit both input and expected files
Future runs will validate against these expectations

Test Development Workflow

		# 1. Create test input
echo 'error log entry here' > _dev/test/pipeline/test-error.log

# 2. Generate expected results
elastic-package test pipeline --data-streams your-stream --generate

# 3. Review generated output
cat _dev/test/pipeline/test-error.log-expected.json

# 4. Run tests to validate
elastic-package test pipeline --data-streams your-stream

# 5. Iterate on pipeline, then regenerate when needed
elastic-package test pipeline --data-streams your-stream --generate
		
	

Troubleshooting

Common issues and solutions:

Test failures with field value mismatches:

		# Run with verbose output to see detailed diffs
elastic-package test pipeline -v --report-format human

# Check for dynamic fields that need configuration
# Add patterns to dynamic_fields in config file

Pipeline not found errors:

		# Verify pipeline files exist
ls -la data_stream/*/elasticsearch/ingest_pipeline/

# Check pipeline syntax
elastic-package lint

# Manually test pipeline upload
curl -X PUT "https://localhost:9200/_ingest/pipeline/your-pipeline" \
  -H "Content-Type: application/json" \
  -d @data_stream/your-stream/elasticsearch/ingest_pipeline/default.yml
		
	

If using curl on localhost, --insecure flag may be required, or the CA certificate can be specified with --cacert ~/.elastic-package/profiles/default/stack/certs/ca-cert.pem.

Multiline parsing issues:

		# Test multiline patterns separately
echo -e "line1\nline2\nline3" | grep -P "^your-pattern"

# Validate regex patterns
python3 -c "import re; print(re.match(r'^your-pattern', 'test-line'))"

Field type mismatches:

		# Check mapping definitions
cat data_stream/*/fields/fields.yml

# Add numeric fields to config if needed
# numeric_keyword_fields: [field.name]

Best Practices

Test Design

Test real data: Use actual log samples from production. Be sure to sanitize any sensitive data before committing to source control.
Cover edge cases: Include malformed, empty, and unusual inputs
Test error conditions: Verify graceful handling of bad data
Keep tests focused: One test file per scenario
Use descriptive names: test-successful-login.log vs test1.log

Test Coverage

Ensure comprehensive coverage by writing tests that can cover as many different scenarios and types of data as possible:

		# Test different log levels
test-debug.log
test-info.log
test-warn.log
test-error.log

# Test different formats
test-json-format.log
test-plain-format.log
test-multiline-stacktrace.log

# Test edge cases
test-empty-lines.log
test-malformed.log
test-unicode.log
		
	

Configuration Management

Minimize static fields: Only add what's necessary
Use dynamic patterns carefully: Too broad patterns may hide real issues
Document regex patterns: Add comments explaining complex patterns

Debugging Tips

Interactive Testing

		# Test individual pipeline components
curl -X POST "https://localhost:9200/_ingest/pipeline/_simulate" \
  -H "Content-Type: application/json" \
  -d '{
    "pipeline": {"processors": [{"grok": {"field": "message", "patterns": ["your-pattern"]}}]},
    "docs": [{"_source": {"message": "test log line"}}]
  }'
		
	

Field Inspection

		# Check what fields are actually generated
elastic-package test pipeline --generate
jq '.expected[0] | keys' test-sample.log-expected.json
		
	

Under the hood

Pipeline tests work by uploading the ingest pipelines to be tested to the configured Elasticsearch instance. The Simulate API is used to process the logs/metrics from the test data files, and then compares the actual results in Elasticsearch to the expected results defined in the test files.

For more information, refer to https://github.com/elastic/elastic-package/blob/main/docs/howto/pipeline_testing.md.