Investigate Kubernetes infrastructure issues with PromQL in Elasticsearch & Kibana

Elasticsearch now supports PromQL natively, and you can run PromQL queries in Kibana through the PROMQL source command in ES|QL. That means you can use PromQL to query your Kubernetes metrics stored in Elasticsearch. You can run those queries directly in Discover, Dashboards or alerting rules.

When cluster CPU spikes and you need to find which workload is responsible, narrow from fleet to namespace to pod, one step at a time.

What you need

An Observability serverless project or a self-managed or Elastic Cloud Hosted stack at version 9.4 or later, where PromQL is available as a preview query language for metrics.
Kubernetes metrics flowing into Elasticsearch. For this exercise we have considered OpenTelemetry data.
One or more clusters with workloads running so group by queries have something to compare.

The scenario

You manage a fleet of Kubernetes clusters:

Cluster	Region	Role
`prod-us-east-1`	US East	Production: services, ML training
`prod-eu-west-1`	EU West	Production: regional web tier, cache
`staging-us-east-1`	US East	Staging: QA, integration tests
`dev-sandbox`	US East	Developer sandbox

The production cluster in US East runs a mix of services and ML training jobs across several namespaces.

An alert fires: cluster-wide CPU is elevated, but only one team is complaining about slower response times.

You are triaging which cluster, then which namespace, then which pod.

You are not after a full root-cause proof in one query, but enough to name the suspect and hand off.

Your data

The OpenTelemetry Collector's Kubelet Stats Receiver populates data streams like metrics-kubeletstatsreceiver.otel-default. Metrics follow the k8s.* naming convention (for example k8s.pod.cpu.usage) and labels like k8s.cluster.name or k8s.namespace.name let you slice by cluster, namespace, or pod.

To verify the data is there, open Discover, switch to ES|QL mode, run TS metrics-*, and scope the query with WHERE data_stream.dataset == "kubeletstatsreceiver.otel".

Investigation: find the noisy neighbor

Step 1: Which cluster is hot?

When you manage multiple clusters, start at the fleet level.

PROMQL sum by (k8s.cluster.name) (k8s.pod.cpu.usage)

This groups total pod CPU by cluster.

prod-us-east-1 immediately stands out: total pod CPU is an order of magnitude higher than the other clusters.

The EU production cluster, staging, and dev-sandbox are all quiet.

Now you know where the problem is, time to zoom in.

Step 2: Overall CPU in the hot cluster

Filter to prod-us-east-1 and look at total CPU:

PROMQL sum(k8s.pod.cpu.usage{k8s.cluster.name="prod-us-east-1"})

This gives you the cluster-wide pod CPU footprint over time.

If the total is climbing or spiking, something changed, but you don't yet know what.

Step 3: Break down by namespace

The fastest way to isolate which team is responsible: group by namespace.

PROMQL sum by (k8s.namespace.name) (k8s.pod.cpu.usage{k8s.cluster.name="prod-us-east-1"})

Set the time picker in Kibana to cover your incident window.

ml-training dominates at ~2.0 cores while every other namespace stays well below 0.2 cores.

Step 4: Drill down to the pod

Now that you know the namespace, identify the specific pod:

PROMQL sum by (k8s.pod.name) (k8s.pod.cpu.usage{k8s.cluster.name="prod-us-east-1", k8s.namespace.name="ml-training"})

That ranks pods in the namespace by total CPU.

The chart should make the outlier obvious.

Pod model-train-v2-run-47-d9j67 is consuming the full 2.0 cores. It is a training job saturating its allocation.

Step 5: Check resource utilization ratios

Raw CPU cores tell you how much. Utilization ratios tell you how close to limits.

A pod hitting 100% of its CPU limit is being throttled, and it is both the noisy neighbor and a victim of its own limits.

PROMQL sum by (k8s.namespace.name) (k8s.container.cpu_limit_utilization{k8s.cluster.name="prod-us-east-1"})

ml-training shows ~100% CPU limit utilization (pegged at the 2-core limit), while the other namespaces stay under 20%.

This confirms the training job is saturating its allocation and likely causing scheduling pressure on the shared node.

What happens next

The PromQL query named the suspect: the training job model-train-v2-run-47 in ml-training.

From here:

Logs: Filter by the pod name in Discover to see what the training job was doing and whether it logged errors or warnings.
Kube events: Check for OOMKilled, throttling, or eviction events in the same time window.
Resource policies: Review whether the training job's requests and limits match its actual usage. A large gap between request and limit lets a pod burst past what the scheduler planned for. Consider ResourceQuota or LimitRange on the namespace.