Bridging the Gap: End-to-End Observability from Cloud Native to Mainframe

Introduction:

OpenTelemetry is emerging as the standard for modern observability. As a highly active project within the Cloud Native Computing Foundation (CNCF)—second only to Kubernetes—it has become the monitoring solution of choice for cloud-native applications. OpenTelemetry provides a unified method for collecting traces, metrics, and logs across Kubernetes, microservices, and infrastructure.

However, for many enterprises—especially in banking, insurance, healthcare, and government—the reality is more complex than just “cloud native.” Although most organizations have deployed mobile apps and adopted microservices architectures, much of their critical core processing still relies on IBM mainframe applications. These systems process credit card swipes, financial transactions, patient records, and premium calculations.

This creates a dilemma: while the modern distributed systems of the hybrid environment are well-observed, the critical backend remains a black box.

The “Broken Trace”

A common challenge we see with customers involves a request that originates from a modern mobile application. The request hits microservices running on Kubernetes, initiates a service call to the mainframe, and suddenly, visibility stops.

When latency spikes or a transaction fails, Site Reliability Engineers (SREs) are left guessing. Is it the network? The API gateway? Or underlying mainframe applications like CICS? Without a unified, end-to-end view of the services involved—from the frontend Node.js microservices to the backend CICS service—mean time to resolution (MTTR) becomes “mean time to innocence,” with teams simply proving it wasn't their microservice rather than fixing root causes.

We need a unified view where a trace flows seamlessly from a cloud-native frontend (like React) all the way into mainframe transactions.

IBM Z Observability Connect

With the recent release of Z Observability Connect, IBM has introduced OpenTelemetry-native instrumentation into mainframe applications. This creates a bridge between modern cloud-native services and mainframe transactions.

This means the mainframe is no longer a special case; it acts just like any other microservice in a mesh. It functions as an OpenTelemetry data producer, emitting traces, metrics, and logs to OpenTelemetry-compliant backends like Elastic.

The Architecture

The architecture is straightforward:

The Collector: IBM Z Observability Connect runs on z/OS. It collects logs, metrics, or traces and converts them into the OTLP (OpenTelemetry Protocol) format.
The Processor: The Elastic Cloud Managed OTLP Endpoint acts as a gateway collector, providing fully hosted, scalable, and reliable native OTLP ingestion.
The Consumer: Elastic APM enables OpenTelemetry-native application performance monitoring, making it easy to pinpoint and fix performance problems quickly.

Putting it all together in Kubernetes

We deploy an OpenTelemetry Collector within our Kubernetes cluster. This collector acts as a specialized gateway. It is configured to receive OTLP traffic directly from IBM Z Observability Connect on the mainframe and forward it securely to our observability backend, Elastic APM, by using the otlp/elastic exporter.

Here is the configuration for the OpenTelemetry Collector. Note the exporters section, which handles the authentication and batched transmission to Elastic:

exporters:
  # Exporter to print the first 5 logs/metrics and then every 1000th
  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 1000

  # Exporter to send logs and metrics to Elasticsearch Managed OTLP Input
  otlp/elastic:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}
    sending_queue:
      enabled: true
      sizer: bytes
      queue_size: 50000000 # 50MB uncompressed
      block_on_overflow: true
    batch:
      flush_timeout: 1s
      min_size: 1_000_000 # 1MB uncompressed
      max_size: 4_000_000 # 4MB uncompressed

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/elastic, debug]

Note: We strongly recommend using environment variables for your endpoints and API keys to keep your manifest secure.

Why the OTel specification matters

Elastic’s managed OTLP endpoint and observability solution is built with native OTel support and adheres to the OTel specification and semantic conventions. Once we wired everything up and the data started to flow, we noticed that some of the traces in Elastic APM were not being represented correctly.

Most observability solutions derive the so-called RED metrics (rate, error, and duration) for the most important spans in a trace—i.e., incoming and outgoing spans of each individual service. This allows for an efficient indication of a service’s performance without the need to comb through all of the tracing data to show something as simple as the latency of a service’s endpoint or the error rate on outgoing requests.

For an efficient calculation of such derived metrics for incoming spans on a service, the OTel community introduced the SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK and SPAN_FLAGS_CONTEXT_IS_REMOTE_MASK flags on the span entities within the OTLP protocol. These flags provide an unambiguous indication of whether an individual span is an entry span and, thus, allow observability backends to efficiently calculate metrics for entry-level spans.

If these flags are set incorrectly for an entry span, the span cannot be recognized as an entry span, and metrics are not derived properly—leading to a broken experience. This is what we initially experienced with the ingested OTel data from the IBM mainframe instrumentation.

In a proprietary world, this might have been a dead end or a months-long troubleshooting exercise. However, since OpenTelemetry is an open standard, we were able to debug the issue rapidly and share our findings with IBM engineers, who quickly developed a fix.

Streamline observability

We now have end-to-end visibility that spans from modern mobile or web applications deep into the IBM mainframe. This unlocks significant value:

Unified Service Maps: You can visually see the dependency between the cloud-native cart service and the backend inventory system on z/OS.
Single Pane of Glass: SREs no longer need to switch between modern observability tools and separate mainframe monitoring tools to view service health.
Operational Efficiency: By eliminating the “blind spot” in the trace, you reduce the time spent on coordinating between cloud and mainframe teams, making issue resolution faster.

Conclusion

If you are running hybrid workloads, it is time to stop treating your mainframe as a black box. With IBM Z Observability Connect, the Elastic Managed OTLP Endpoint, and Elastic APM, your entire stack can finally speak a single language: OpenTelemetry.

Bridging the Gap: End- to- End Observability from Cloud Native to Mainframe