Troubleshooting Kibana health

134925_-_Blog_header-Troubleshooting_Kibana_health_V1.jpg

You can access the Elasticsearch® database through Elastic’s UI product: Kibana®. While most Elastic® services run from Elasticsearch, a handful of key features run from within Kibana’s servers: Telemetry, Rules (with associated Actions), Reporting, and Sessions. We’ll walk through troubleshooting overall Kibana health and its most popular plugins. This content is generally true for 7.x and 8.x, but it is tested directly against v8.7.1.

Dependencies

Kibana’s overall availability is dependent upon its ability to:

  1. Connect to and authenticate against Elasticsearch

  2. Access healthy .kibana* System Indices to load its Saved Objects (Some plugins require special system indices, e.g., reporting also relies on .reporting*, which can result in Kibana being performant overall, but specific plugins experiencing errors.)

  3. Initialize its Task Manager and other plugins (Some plugins, like Alerting and Reporting, are dependent upon others so cascaded errors can potentially be rooted in a single or handful of plugins rather than all degraded plugins. Most commonly, if the Task Manager plugin is experiencing issues, this will cascade into multiple plugins reporting issues.)

  4. Task Manager health, which is dependent upon its config, software-vs-hardware capacity estimation, runtime, and workload

If Kibana experiences issues on step 1 or step 2, it’ll error Kibana Server is not ready yet, which can be investigated via our Troubleshooting Kibana Access doc.

Check overall health

As long as Kibana is available and so doesn’t error Kibana Server is not ready yet, then you can quickly check the Kibana API for Kibana’s Status and the Task Manager’s Health API using DevTools:

# undocumented, internal API
## ui
> GET KIBANA_URL/status
## api
> GET kbn:/api/status

# https://www.elastic.co/guide/en/kibana/current/task-manager-api-health.html
> GET kbn:api/task_manager/_health

Once you have these responses stored respectively as kibana_status.json and kibana_task_manager_health.json, then you have the two core APIs Support inspects from a Kibana’s Support Diagnostic pull to check Kibana’s overall performance.

You can inspect the answers to our step-wise dependencies manually or use the third-party JQ tool to automate the JSON parsing. In this blog, I’ll share the JQ commands, but the language’s expressions are simple enough to read for you to manually check outputs if desired.

You can start by confirming if Kibana is currently available or degraded:

$ cat kibana_status.json | jq '{ overall: .status.overall.level }'

As needed, you can check Kibana logs if status health flapping is suspected after initial startup.

Check dependencies

Kibana’s overall status is a roll-up of the discussed dependencies, so let’s check them:

1+2) Elasticsearch Connectivity + Saved Objects. To confirm Kibana can authenticate to Elasticsearch and pull its Saved Objects from .kibana* system indices, we run:

$ cat kibana_status.json | jq -r '.status.core|{ elasticsearch: .elasticsearch.level, savedObjects: .savedObjects.level }'

We’ll cover the Task Manager health in the next section to give it the space-time it deserves.

4) Plugins. To check all plugin health, you can run the following JQ:

$ cat kibana_status.json | jq -rc '.status.plugins|to_entries[]|{ plugin: .key, status: .value.level }'

As desired, you can filter to problematic plugins by appending | grep -vw ‘available’

If you show multiple plugins in degraded status, it might be worth running the following more targeted query. This query highlights the plugins most likely to cascade into errors for the other plugins or surface as errors to your users:

$ cat kibana_status.json | jq -rc '.status.plugins|to_entries[]|select(.key=="taskManager" or .key=="savedObjects" or .key=="security" or .key=="reporting" or .key=="ruleRegistry" or .key=="alerting") |{plugin:.key, status:.value.level, reason:.value.summary}'

I’ve included reason in this latter query, which can return a quite verbose output to help show how plugin problems can root in other unhealthy plugins. Let’s look at an example output:

{"plugin":"reporting","status":"degraded","reason":"4 services are degraded: data, taskManager, embeddable and 1 other(s)"}
{"plugin":"ruleRegistry","status":"degraded","reason":"3 services are degraded: data, triggersActionsUi, security"}
{"plugin":"savedObjects","status":"degraded","reason":"1 service is degraded: data"}
{"plugin":"security","status":"degraded","reason":"1 service is degraded: taskManager"}
{"plugin":"taskManager","status":"degraded","reason":"Task Manager is unhealthy"}

Looking at this root-up, we show the Task Manager is unhealthy, which causes the following plugins to be unhealthy: security, savedObjects, and reporting. Next-step or eventual cascade this causes ruleRegistry and reporting (again) to report degraded. So for this output, the only plugin to focus on resolving is the taskManager and then we can revisit once we confirm it’s not cascade impacting the other plugins.

I’ll note these plugins can have a potential impact on users’ experience (e.g., loading Discover — see our troubleshooting blog).

Check Task Manager

The Task Manager is the backbone of Kibana, so it has its own Health API. While it is performant the vast majority of the time for Elastic’s customers, it is the most common source of performance issues and so has its own Troubleshooting doc. Hence why we’ve split it into its own section. 

>> In search of a cluster health diagnosis: Introducing the Elasticsearch Health API

I want to confirm, you probably don’t need to check this section unless the taskManager plugin flagged in the previous section. However, assuming it did, let’s first confirm the Task Manager agrees on its reported status:

$ cat kibana_task_manager_health.json | jq -r '{ overall:.status }'

The Task Manager itself is a roll-up of four subsections, so let’s next check how its child objects are doing:

$ cat kibana_task_manager_health.json | jq -r '{ capacity:.stats.capacity_estimation.status, config:.stats.configuration.status, runtime:.stats.runtime.status, workload:.stats.workload.status }'

For most performance issues, these subsections will report obvious health issues via their status. You can confirm the subsection status’s roll-up via its own child data: capacity, config, runtime, workload.

Example overall

Now that we’ve outlined troubleshooting overall Kibana health, let’s go through a literal example. This example reflects the most common Kibana health issue escalated to Support: Alerting Rules and/or Security Rules and their impact against Kibana and Elasticsearch resources via quantity or individual expensiveness. 

So when I was going through this analysis, I ran an automated report, which was the basis of the earlier JQ commands:

$ cat kibana_status.json | jq '{ overall: .status.overall.level }'
{
  overall: "degraded"
}

# core
$ cat kibana_status.json | jq -r '.status.core|{ elasticsearch: .elasticsearch.level, savedObjects: .savedObjects.level }'
{
  elasticsearch: "available",
  savedObjects:  "available"
}

# main cascading plugins
$ cat kibana_status.json | jq -r '.status.plugins|{ taskManager:.taskManager.level, savedObject:.savedObjects.level, security:.security.level, reporting:.reporting.level }'
{
  taskManager:  "degraded",
  savedObject:  "degraded",
  security: 	"degraded",
  ruleRegistry: "degraded",
  reporting:	"degraded"
}

$ cat kibana_status.json | jq -r '.status.plugins.taskManager'
{
  "level": "degraded",
  "summary": "Task Manager is unhealthy"
}

$ cat kibana_task_manager_health.json | jq -r '{ overall:.status }'
{
  overall: "warn"
}

$ cat kibana_task_manager_health.json | jq -r '{ capacity:.stats.capacity_estimation.status, config:.stats.configuration.status, runtime:.stats.runtime.status, workload:.stats.workload.status }'
{
  capacity:  "warn",
  config:	"OK",
  runtime:   "OK",
  workload:  "OK"
}

So at this point, the story I have rolling in my head is that Kibana is experiencing duration performance issues due to Task Manager having capacity warnings. So from the capacity troubleshooting doc, I’ll confirm how many servers Kibana thinks it has versus proposes.

$ cat kibana_task_manager_health.json | jq -r '.stats.capacity_estimation.value|{observes:.observed.observed_kibana_instances, proposes:.proposed.provisioned_kibana}'
{
  "observes": 1,
  "proposes": 2
}

So, we have one observed server and Kibana estimates that it needs two servers. Kindly note some stray troubleshooting details:

  • Observed is looking toward recent server IDs, which may rotate upon Kibana restart and may report higher than actual.
  • Task Manager’s proposed servers stat is always looking at horizontal scaling; however, vertically scaling may also potentially answer — kindly confirm via our scaling guidance.

In this particular case, I wanted to dig more into why Kibana thought it needed more resources before scaling up. So cross-comparing Task Manager Health’s runtime (even though it wasn’t erring), I ended up determining I had pretty significant drift going on.

$ cat kibana_task_manager_health.json | jq '.stats.runtime.value|{drift, load}'
{
  "drift": {
	"p50": 64544,
	"p90": 80640,
	"p95": 80640,
	"p99": 80640
  },
  "load": {
	"p50": 70,
	"p90": 100,
	"p95": 100,
	"p99": 100
  }
}

Since drift reports in milliseconds, this indicates my p50 drift is a bit over a minute. That’s pretty backed up. So I’m going to look at drift_by_type in order to see if Kibana features are backed up evenly across the board (which is very uncommon) or which features are backing up the others:

$ cat kibana_task_manager_health.json | jq -rc '.stats.runtime.value.drift_by_type|to_entries[]|{type:.key, p99:.value.p99}|select(.p99>80000)'
{"type":"Fleet-Usage-Logger","p99":108317}
{"type":"actions:.webhook","p99":80640}
{"type":"alerting:siem.mlRule","p99":108616}
{"type":"alerting:siem.queryRule","p99":107950}
{"type":"reports:monitor","p99":92879}

You’ll note I filtered .p99>80000, which was a figure determined by the output of my earlier drift query (and laziness to write out the full number instead of just adding 0s after the 8 to the right placeholder). 

This output tells us some one-off Features (e.g., Fleet-Usage-Logger and reports:monitor) are elevated, but I care less since those aren’t re-occurrent, user-impacting tasks. Instead, the alerting:siem.mlRule and alerting.siem.queryRule appear to be my drift inducers, maybe with a secondary impact by actions:webhook.

Example Rules

Historically, users are accustomed to Kibana only being a UI interface to Elasticsearch and its only expensive tasks being from Reporting. With the addition of the Alerting framework in 7.7 and especially Security’s Prebuilt Detection Rules in v7.9, users started enabling and leaving tests without realizing each one had a degree of overhead against the Task Manager. If the Rule is use-case applicable, then I’m all for it; but Support sees many users who haven’t realized that (e.g., enabling 2,000 random Rules each polling every 1–5 minutes directly implies that Kibana would need to scale beyond its historically minimal sizing).

This particular Alerting problem boxing usually goes one of two ways: quantity or quality. Either the user (or their customers) has enabled a high quantity of Rules and/or only one or a handful of Rules are much more expensive than the rest. 

Quantity. In order to quick check the amount of enabled Rules, I prefer to run the following Elasticsearch’s Count API:

# all rules
GET .kibana*/_count?q=alert.enabled:true&filter_path=count

# security rules
GET .kibana*/_count?q=alert.enabled:true AND alert.consumer:siem&filter_path=count

Kindly note, this targets system index data, which can be read by admin users but should never be directly updated via this backend view. As Support, I prefer to query this way because this index is Kibana space-agnostic versus the frontend APIs, which are space-aware. 

Expensive. For the sake of our example, let’s assume quantity wasn’t our problem. Instead, we’ll want to check if some amount of Rules are more expensive than the rest. In my experience, this is quite common (e.g., 1 Rule of 3,000 takes 1 minute but the rest take <7 seconds). 

In order to troubleshoot Expensive Rules, we have a couple of options: 

  1. (Space-agnostic) use Kibana’s Event Log to run the Expensive Rule query 

  2. (Alerting) use Stack Management > Rules “Duration” table

  3. (Security) use Security > Rule Monitoring tab

Options 2 and 3 are very user friendly and I highly recommend them. Option 1 is Support’s go-to since we don’t get to see user UIs and screenshots aren’t the same. So much in fact that I frequently use a variation of Option 1 to distill the histogram response into a table of problematic Rules to investigate:

# event.duration in nanoseconds
GET .kibana-event-log-*/_search
{
  "aggs": {
	"rule_id": {
  	"aggs": {
    	"avg": {
      	"avg": {
        	"field": "event.duration"
      	}
    	},
    	"rule_name": {
      	"terms": {
        	"field": "rule.name",
        	"size": 1
      	}
    	}
  	},
  	"terms": {
    	"field": "rule.id",
    	"order": {
      	"avg": "desc"
    	},
    	"size": 20
  	}
	}
  },
  "query": {
	"bool": {
  	"filter": [{"range": {"@timestamp": {"gte": "now-24h","lte": "now"}}}],
  	"must": [{"query_string": {"analyze_wildcard": true,"query": "event.provider:alerting AND event.action:execute"}}]
	}
  },
  "size": 0
}

You’ll note now-24h can be expanded (e.g., now-7d to see longer/shorter historical time frames). 

The baseline duration expectations I try to set with users for Rules impacting overall performance are that: 

  • 10 seconds every 1 minute may impact or may be OK depending on resources
  • Anything longer than 1 minute is likely impacting
  • It seems egregious to have to say but I run into it weekly: Rules taking >5 mins to run are definitely problematic

Stay healthy!

I’m a visual person, so Kibana is my favorite Elastic product. It’s so powerful. I use it every day to discover patterns and bring my data’s story to life. Therefore, its health is very important to me. We’ve now seen how to use two Kibana APIs to introspect its health and even expanded our example to look at Kibana’s Rule health as well. I recommend you check out Kibana’s Alerting to automate your monitoring today — just remember to disable any unneeded Rules when you’re done testing!

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.