Introduction:
OpenTelemetry is emerging as the standard for modern observability. As a highly active project within the Cloud Native Computing Foundation (CNCF)—second only to Kubernetes—it has become the monitoring solution of choice for cloud-native applications. OpenTelemetry provides a unified method for collecting traces, metrics, and logs across Kubernetes, microservices, and infrastructure.
However, for many enterprises—especially in banking, insurance, healthcare, and government—the reality is more complex than just “cloud native.” Although most organizations have deployed mobile apps and adopted microservices architectures, much of their critical core processing still relies on IBM mainframe applications. These systems process credit card swipes, financial transactions, patient records, and premium calculations.
This creates a dilemma: while the modern distributed systems of the hybrid environment are well-observed, the critical backend remains a black box.
The “Broken Trace”
A common challenge we see with customers involves a request that originates from a modern mobile application. The request hits microservices running on Kubernetes, initiates a service call to the mainframe, and suddenly, visibility stops.
When latency spikes or a transaction fails, Site Reliability Engineers (SREs) are left guessing. Is it the network? The API gateway? Or underlying mainframe applications like CICS? Without a unified, end-to-end view of the services involved—from the frontend Node.js microservices to the backend CICS service—mean time to resolution (MTTR) becomes “mean time to innocence,” with teams simply proving it wasn't their microservice rather than fixing root causes.
We need a unified view where a trace flows seamlessly from a cloud-native frontend (like React) all the way into mainframe transactions.
IBM Z Observability Connect
With the recent release of Z Observability Connect, IBM has introduced OpenTelemetry-native instrumentation into mainframe applications. This creates a bridge between modern cloud-native services and mainframe transactions.
This means the mainframe is no longer a special case; it acts just like any other microservice in a mesh. It functions as an OpenTelemetry data producer, emitting traces, metrics, and logs to OpenTelemetry-compliant backends like Elastic.
The Architecture
The architecture is straightforward:
- The Collector: IBM Z Observability Connect runs on z/OS. It collects logs, metrics, or traces and converts them into the OTLP (OpenTelemetry Protocol) format.
- The Processor: The Elastic Cloud Managed OTLP Endpoint acts as a gateway collector, providing fully hosted, scalable, and reliable native OTLP ingestion.
- The Consumer: Elastic APM enables OpenTelemetry-native application performance monitoring, making it easy to pinpoint and fix performance problems quickly.
Putting it all together in Kubernetes
We deploy an OpenTelemetry Collector within our Kubernetes cluster. This collector acts as a specialized gateway. It is configured to receive OTLP traffic directly from IBM Z Observability Connect on the mainframe and forward it securely to our observability backend, Elastic APM, by using the
Here is the configuration for the OpenTelemetry Collector. Note the
exporters:
# Exporter to print the first 5 logs/metrics and then every 1000th
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 1000
# Exporter to send logs and metrics to Elasticsearch Managed OTLP Input
otlp/elastic:
endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
headers:
Authorization: ApiKey ${env:ELASTIC_API_KEY}
sending_queue:
enabled: true
sizer: bytes
queue_size: 50000000 # 50MB uncompressed
block_on_overflow: true
batch:
flush_timeout: 1s
min_size: 1_000_000 # 1MB uncompressed
max_size: 4_000_000 # 4MB uncompressed
service:
extensions: [pprof, zpages, health_check]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/elastic, debug]
Note: We strongly recommend using environment variables for your endpoints and API keys to keep your manifest secure.
Why the OTel specification matters
Elastic’s managed OTLP endpoint and observability solution is built with native OTel support and adheres to the OTel specification and semantic conventions. Once we wired everything up and the data started to flow, we noticed that some of the traces in Elastic APM were not being represented correctly.
Most observability solutions derive the so-called RED metrics (rate, error, and duration) for the most important spans in a trace—i.e., incoming and outgoing spans of each individual service. This allows for an efficient indication of a service’s performance without the need to comb through all of the tracing data to show something as simple as the latency of a service’s endpoint or the error rate on outgoing requests.
For an efficient calculation of such derived metrics for incoming spans on a service, the OTel community introduced the
If these flags are set incorrectly for an entry span, the span cannot be recognized as an entry span, and metrics are not derived properly—leading to a broken experience. This is what we initially experienced with the ingested OTel data from the IBM mainframe instrumentation.
In a proprietary world, this might have been a dead end or a months-long troubleshooting exercise. However, since OpenTelemetry is an open standard, we were able to debug the issue rapidly and share our findings with IBM engineers, who quickly developed a fix.
Streamline observability
We now have end-to-end visibility that spans from modern mobile or web applications deep into the IBM mainframe. This unlocks significant value:
- Unified Service Maps: You can visually see the dependency between the cloud-native cart service and the backend inventory system on z/OS.
- Single Pane of Glass: SREs no longer need to switch between modern observability tools and separate mainframe monitoring tools to view service health.
- Operational Efficiency: By eliminating the “blind spot” in the trace, you reduce the time spent on coordinating between cloud and mainframe teams, making issue resolution faster.
Conclusion
If you are running hybrid workloads, it is time to stop treating your mainframe as a black box. With IBM Z Observability Connect, the Elastic Managed OTLP Endpoint, and Elastic APM, your entire stack can finally speak a single language: OpenTelemetry.
