Troubleshoot Kibana Task Manager

Kibana instances do not form a cluster. Instead, a Kibana instance manages its state by syncing it down to Elasticsearch under the kibana feature state. This is why Kibana becomes unavailable when Elasticsearch or Kibana-specific underlying indices are unavailable. Part of this synced state includes Kibana's Task Manager.

The Task Manager is used by a wide range of services in Kibana, such as Alerting, Actions, Reporting, and Telemetry. Unexpected behavior in these services can be related to issues in the Task Manager, which you can check from the Kibana server status.

This page describes how to troubleshoot the Task Manager health and resolve common problems you might encounter. If your problem isn’t described on this page, check open issues in the elastic/Kibana GitHub repository or share a Kibana diagnostic when you contact Elastic support for assistance.

Check Task Manager health

The following steps demonstrate a typical investigation flow. It assumes the Task Manager Health is saved locally as task_manager_health.json, and uses third-party tool JQ as a JSON processor. It does not require that the taskManager plugin is flagged as unhealthy under the Kibana server status.

Check the overall status.

cat kibana_task_manager_health.json | jq -rc '{ overall: .status }'

Check the subsections' statuses.

		cat kibana_task_manager_health.json | jq -r '{ capacity:. stats.capacity_estimation.status, config: .stats.configuration.status, runtime: .stats.runtime.status, workload: .stats.workload.status }'
		
	

Possible health statuses are OK, Warning, and Error.

It is possible that subsections show as healthy but that the Task Manager still reports high load or drift.

cat kibana_task_manager_health.json | jq '.stats.runtime.value|{drift, load}'

For most performance issues, the subsections indicate explicit health issues through their status. You can confirm the subsection status’s rollup from its own child data:

Section	Description
Configuration	This section summarizes the current configuration of the Task Manager. This includes dynamic configurations that change over time, such as `poll_interval` and `max_workers`, which can vary when the load on the system changes.
Workload	This section summarizes the workload across the cluster, including the tasks in the system, their type, and current status.
Runtime	This section tracks execution performance of the Task Manager, tracking task drift, worker load, and execution stats broken down by type, including duration and execution results.
Capacity Estimation	This section provides a rough estimate about the sufficiency of its capacity. These are estimates based on historical data and should not be used as predictions. Use these estimates when following the Task Manager scaling guidance.

Important

Some tasks (such as connector tasks) can incorrectly report their status as successful even when the task failed. The runtime and workload block return data about successes and failures and don't take this into consideration.

To get a better sense of action failures, refer to the Event log index for more accurate context about failures and successes.

Diagnose a root cause for drift

The following guide helps you identify a root cause for drift by making sense of the output from the Health monitoring endpoint.

By analyzing the different sections of the output, you can evaluate different theories that explain the drift in a deployment.

Evaluate the Configuration
- Kibana is configured to poll for tasks at a reduced rate
Evaluate the Runtime
Evaluate the Workload
Evaluate the Capacity Estimation

Retrieve the latest monitored health stats of a Kibana instance Task Manager:

$ curl -X GET api/task_manager/_health

The API returns the following:

		{
  "id": "15415ecf-cdb0-4fef-950a-f824bd277fe4",
  "timestamp": "2021-02-16T11:38:10.077Z",
  "status": "OK",
  "last_update": "2021-02-16T11:38:09.934Z",
  "stats": {
    "configuration": {
      "timestamp": "2021-02-16T11:29:05.055Z",
      "value": {
        "request_capacity": 1000,
        "monitored_aggregated_stats_refresh_rate": 60000,
        "monitored_stats_running_average_window": 50,
        "monitored_task_execution_thresholds": {
          "default": {
            "error_threshold": 90,
            "warn_threshold": 80
          },
          "custom": {}
        },
        "poll_interval": 3000,
        "max_workers": 10
      },
      "status": "OK"
    },
    "runtime": {
      "timestamp": "2021-02-16T11:38:09.934Z",
      "value": {
        "polling": {
          "last_successful_poll": "2021-02-16T11:38:09.934Z",
          "last_polling_delay": "2021-02-16T11:29:05.053Z",
          "duration": {
            "p50": 13,
            "p90": 128,
            "p95": 143,
            "p99": 168
          },
          "claim_conflicts": {
            "p50": 0,
            "p90": 0,
            "p95": 0,
            "p99": 0
          },
          "claim_mismatches": {
            "p50": 0,
            "p90": 0,
            "p95": 0,
            "p99": 0
          },
          "result_frequency_percent_as_number": {
            "Failed": 0,
            "NoAvailableWorkers": 0,
            "NoTasksClaimed": 80,
            "RanOutOfCapacity": 0,
            "RunningAtCapacity": 0,
            "PoolFilled": 20
          }
        },
        "drift": {
          "p50": 99,
          "p90": 1245,
          "p95": 1845,
          "p99": 2878
        },
        "load": {
          "p50": 0,
          "p90": 0,
          "p95": 10,
          "p99": 20
        },
        "execution": {
          "duration": {
            "alerting:.index-threshold": {
              "p50": 95,
              "p90": 1725,
              "p95": 2761,
              "p99": 2761
            },
            "alerting:xpack.uptime.alerts.monitorStatus": {
              "p50": 149,
              "p90": 1071,
              "p95": 1171,
              "p99": 1171
            },
            "actions:.index": {
              "p50": 166,
              "p90": 166,
              "p95": 166,
              "p99": 166
            }
          },
          "persistence": {
            "recurring": 88,
            "non_recurring": 4,
          },
          "result_frequency_percent_as_number": {
            "alerting:.index-threshold": {
              "Success": 100,
              "RetryScheduled": 0,
              "Failed": 0,
              "status": "OK"
            },
            "alerting:xpack.uptime.alerts.monitorStatus": {
              "Success": 100,
              "RetryScheduled": 0,
              "Failed": 0,
              "status": "OK"
            },
            "actions:.index": {
              "Success": 10,
              "RetryScheduled": 0,
              "Failed": 90,
              "status": "error"
            }
          }
        }
      },
      "status": "OK"
    },
    "workload": {
      "timestamp": "2021-02-16T11:38:05.826Z",
      "value": {
        "count": 26,
        "task_types": {
          "alerting:.index-threshold": {
            "count": 2,
            "status": {
              "idle": 2
            }
          },
          "actions:.index": {
            "count": 14,
            "status": {
              "idle": 2,
              "running": 2,
              "failed": 10
            }
          },
          "alerting:xpack.uptime.alerts.monitorStatus": {
            "count": 10,
            "status": {
              "idle": 10
            }
          },
        },
        "schedule": [
          ["10s", 2],
          ["1m", 2],
          ["60s", 2],
          ["5m", 2],
          ["60m", 4],
          ["3600s", 1],
          ["720m", 1]
        ],
        "non_recurring": 18,
        "owner_ids": 0,
        "overdue": 10,
        "overdue_non_recurring": 10,
        "estimated_schedule_density": [0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 3, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0],
        "capacity_requirements": {
          "per_minute": 6,
          "per_hour": 28,
          "per_day": 2
        }
      },
      "status": "OK"
    },
    "capacity_estimation": {
      "timestamp": "2021-02-16T11:38:06.826Z",
      "value": {
        "observed": {
          "observed_kibana_instances": 1,
          "max_throughput_per_minute_per_kibana": 200,
          "max_throughput_per_minute": 200,
          "minutes_to_drain_overdue": 1,
          "avg_recurring_required_throughput_per_minute": 28,
          "avg_recurring_required_throughput_per_minute_per_kibana": 28,
          "avg_required_throughput_per_minute": 28,
          "avg_required_throughput_per_minute_per_kibana": 28
        },
        "proposed": {
          "min_required_kibana": 1,
          "provisioned_kibana": 1,
          "avg_recurring_required_throughput_per_minute_per_kibana": 28,
          "avg_required_throughput_per_minute_per_kibana": 28
        }
      }
      "status": "OK"
    }
  }
}
		
	

Evaluate the Configuration

Theory: Kibana is configured to poll for tasks at a reduced rate.

Diagnosis: Evaluating the health stats, you can see the following output under stats.configuration.value:

		{
  "request_capacity": 1000,
  "monitored_aggregated_stats_refresh_rate": 60000,
  "monitored_stats_running_average_window": 50,
  "monitored_task_execution_thresholds": {
    "default": {
      "error_threshold": 90,
      "warn_threshold": 80
    },
    "custom": {}
  },
  "poll_interval": 3000,
  "max_workers": 10
}
		
	

poll_interval is set to the default value of 3000 milliseconds
max_workers is set to the default value of 10 workers

You can infer from this output that the Kibana instance polls for work every 3 seconds and can run 10 concurrent tasks.

Now suppose the output under stats.configuration.value is the following:

		{
  "request_capacity": 1000,
  "monitored_aggregated_stats_refresh_rate": 60000,
  "monitored_stats_running_average_window": 50,
  "monitored_task_execution_thresholds": {
    "default": {
      "error_threshold": 90,
      "warn_threshold": 80
    },
    "custom": {}
  },
  "poll_interval": 60000,
  "max_workers": 1
}
		
	

poll_interval is set to 60000 milliseconds, far higher than the default
max_workers is set to 1 worker, far lower than the default

You can infer from this output that the Kibana instance only polls for work once a minute and only picks up one task at a time. This throughput is unlikely to support mission critical services, such as Alerting or Reporting, and tasks will usually run late.

There are two possible reasons for such a configuration:

These settings have been configured manually, which can be resolved by reconfiguring these settings. For details, see Task Manager Settings.
Kibana has reduced its own throughput in reaction to excessive load on the Elasticsearch cluster.

Task Manager is equipped with a reactive self-healing mechanism in response to an increase in load related errors in Elasticsearch. This mechanism will increase the poll_interval setting (reducing the rate at which it queries Elasticsearch), and decrease the max_workers (reducing the amount of operations it executes against Elasticsearch). Once the error rate reduces, these settings are incrementally dialed up again, returning them to the configured settings.

This scenario can be identified by searching the Kibana Server Log for messages such as:
```
Max workers configuration is temporarily reduced after Elasticsearch returned 25 "too many request" error(s).
		
```
Deeper investigation into the high error rate experienced by the Elasticsearch cluster is required.

Evaluate the Runtime

Theory: Kibana is not polling as frequently as it should

Diagnosis: Evaluating the health stats, you see the following output under stats.runtime.value.polling:

		{
  "last_successful_poll": "2021-02-16T11:38:09.934Z",
  "last_polling_delay": "2021-02-14T11:29:05.053Z",
  "duration": {
    "p50": 13,
    "p90": 128,
    "p95": 143,
    "p99": 168
  },
  "claim_conflicts": {
    "p50": 0,
    "p90": 0,
    "p95": 0,
    "p99": 2
  },
  "claim_mismatches": {
    "p50": 0,
    "p90": 0,
    "p95": 0,
    "p99": 0
  },
  "result_frequency_percent_as_number": {
    "Failed": 0,
    "NoAvailableWorkers": 0,
    "NoTasksClaimed": 80,
    "RanOutOfCapacity": 0,
    "RunningAtCapacity": 0,
    "PoolFilled": 20
  }
}
		
	

Ensure the last successful polling cycle was completed no more than a couple of multiples of poll_interval in the past.
Ensure the duration of polling cycles is usually below 100ms. Longer durations are possible, but unexpected.
Ensure Kibana instances in the cluster are not encountering a high rate of version conflicts.
Ensure the majority of polling cycles result in positive outcomes, such as RunningAtCapacity or PoolFilled.

You can infer from this output that the Kibana instance is polling regularly. This assessment is based on the following:

Comparing the last_successful_poll to the timestamp (value of 2021-02-16T11:38:10.077Z) at the root, where you can see the last polling cycle took place 1 second before the monitoring stats were exposed by the health monitoring API.
Comparing the last_polling_delay to the timestamp (value of 2021-02-16T11:38:10.077Z) at the root, where you can see the last polling cycle delay took place 2 days ago, suggesting Kibana instances are not conflicting often.
The p50 of the duration shows that at least 50% of polling cycles take, at most, 13 milliseconds to complete.
Evaluating the result_frequency_percent_as_number:
- 80% of the polling cycles completed without claiming any tasks (suggesting that there aren’t any overdue tasks).
- 20% completed with Task Manager claiming tasks that were then executed.
- None of the polling cycles ended up occupying all of the available workers, as RunningAtCapacity has a frequency of 0%, suggesting there is enough capacity in Task Manager to handle the workload.

All of these stats are tracked as a running average, which means that they give a snapshot of a period of time (by default Kibana tracks up to 50 cycles), rather than giving a complete history.

Suppose the output under stats.runtime.value.polling.result_frequency_percent_as_number was the following:

		{
  "Failed": 30,
  "NoAvailableWorkers": 20,
  "NoTasksClaimed": 10,
  "RanOutOfCapacity": 10,
  "RunningAtCapacity": 10,
  "PoolFilled": 20
}
		
	

30% of polling cycles failed, which is a high rate.
20% of polling cycles are skipped as Task Manager has no capacity left to run tasks.
10% of polling cycles result in Task Manager claiming more tasks than it has capacity to run.
10% of polling cycles result in Task Manager claiming precisely as many tasks as it has capacity to run.

You can infer from this output that Task Manager is not healthy, as the failure rate is high, and Task Manager is fetching tasks it has no capacity to run. Analyzing the Kibana Server Log should reveal the underlying issue causing the high error rate and capacity issues.

The high NoAvailableWorkers rate of 20% suggests that there are many tasks running for durations longer than the poll_interval. For details on analyzing long task execution durations, see the long running tasks theory.

Theory: Kibana is polling as frequently as it should, but that isn’t often enough to keep up with the workload

Diagnosis: Evaluating the health stats, you can see the following output of drift and load under stats.runtime.value:

		{
  "drift": {
    "p50": 99,
    "p90": 1245,
    "p95": 1845,
    "p99": 2878
  },
  "load": {
    "p50": 0,
    "p90": 0,
    "p95": 10,
    "p99": 20
  },
}
		
	

drift shows us that at least 95% of tasks are running within 2 seconds of their scheduled time.
load shows us that Task Manager is idle at least 90% of the time, and never uses more than 20% of its available workers.

You can infer from these stats that this Kibana has plenty of capacity, and any delays you might be experiencing are unlikely to be addressed by expanding the throughput.

Suppose the output of drift and load was the following:

		{
  "drift": {
    "p50": 2999,
    "p90": 3845,
    "p95": 3845.75,
    "p99": 4078
  },
  "load": {
    "p50": 80,
    "p90": 100,
    "p95": 100,
    "p99": 100
  }
}
		
	

drift shows us that all tasks are running 3 to 4 seconds after their scheduled time.
load shows us that at least half of the time Task Manager is running at a load of 80%.

You can infer from these stats that this Kibana is using most of its capacity, but seems to keep up with the work most of the time. This assessment is based on the following:

The p90 of load is at 100%, and p50 is also quite high at 80%. This means that there is little to no room for maneuvering, and a spike of work might cause Task Manager to exceed its capacity.
Tasks run soon after their scheduled time, which is to be expected. A poll_interval of 3000 milliseconds would often experience a consistent drift of somewhere between 0 and 3000 milliseconds. A p50 drift of 2999 suggests that there is room for improvement, and you could benefit from a higher throughput.

For details on achieving higher throughput by adjusting your scaling strategy, see Scaling guidance.

Theory: Tasks run for too long, overrunning their schedule

Diagnosis: The Insufficient throughput to handle the scheduled workload theory analyzed a hypothetical scenario where both drift and load were unusually high.

Suppose an alternate scenario, where drift is high, but load is not, such as the following:

		{
    "drift": {
        "p50": 9799,
        "p90": 83845,
        "p95": 90328,
        "p99": 123845
    },
    "load": {
        "p50": 40,
        "p90": 75,
        "p95": 80,
        "p99": 100
    }
}
		
	

drift shows that most (if not all) tasks are running at least 32 seconds too late.
load shows that, for the most part, you have capacity to run more concurrent tasks.

In the preceding scenario, the tasks are running far too late, but you have sufficient capacity to run more concurrent tasks. A high capacity allows Kibana to run multiple different tasks concurrently. If a task is already running when its next schedule run is due, Kibana will avoid running it a second time, and instead wait for the first execution to complete.

If a task takes longer to execute than the cadence of its schedule, then that task will always overrun and experience a high drift. For example, suppose a task is scheduled to execute every 3 seconds, but takes 6 seconds to complete. It will consistently suffer from a drift of, at least, 3 seconds.

Evaluating the health stats in this hypothetical scenario, you see the following output under stats.runtime.value.execution.duration:

		{
  "alerting:.index-threshold": {
    "p50": 95,
    "p90": 1725,
    "p95": 2761,
    "p99": 2761
  },
  "alerting:.es-query": {
    "p50": 7149,
    "p90": 40071,
    "p95": 45282,
    "p99": 121845
  },
  "actions:.index": {
    "p50": 166,
    "p90": 166,
    "p95": 166,
    "p99": 166
  }
}
		
	

50% of the tasks backing index threshold alerts complete in less than 100 milliseconds.
50% of the tasks backing Elasticsearch query alerts complete in 7 seconds, but at least 10% take longer than 40 seconds.

You can infer from these stats that the high drift the Task Manager is experiencing is most likely due to Elasticsearch query alerts that are running for a long time.

Resolving this issue is context dependent and changes from case to case. In the preceding example, this would be resolved by modifying the queries in these alerts to make them faster, or improving the Elasticsearch throughput to speed up the exiting query.

Theory: Tasks take multiple attempts to succeed

Diagnosis: A high error rate could cause a task to appear to run late, when in fact it runs on time, but experiences a high failure rate.

Evaluating the preceding health stats, you see the following output under stats.runtime.value.execution.result_frequency_percent_as_number:

		{
  "alerting:.index-threshold": {
    "Success": 100,
    "RetryScheduled": 0,
    "Failed": 0,
    "status": "OK"
  },
  "alerting:xpack.uptime.alerts.monitorStatus": {
    "Success": 100,
    "RetryScheduled": 0,
    "Failed": 0,
    "status": "OK"
  },
  "actions:.index": {
    "Success": 8,
    "RetryScheduled": 0,
    "Failed": 92,
    "status": "error"
  }
}
		
	

100% of the tasks backing index threshold alerts successfully complete.
92% of the tasks backing ES index actions fail to complete.
The tasks backing ES index actions have exceeded the default monitored_task_execution_thresholds error configuration.

You can infer from these stats that most actions:.index tasks, which back the ES Index Kibana action, fail. Resolving that would require deeper investigation into the Kibana Server Log, where the exact errors are logged, and addressing these specific errors.

Theory: Spikes in non-recurring tasks are consuming a high percentage of the available capacity

Diagnosis: Task Manager uses ad-hoc non-recurring tasks to load balance operations across multiple Kibana instances.

Evaluating the preceding health stats, you see the following output under stats.runtime.value.execution.persistence:

		{
  "recurring": 88,
  "non_recurring": 12,
},
		
	

88% of executed tasks are recurring tasks
12% of executed tasks are non-recurring tasks

You can infer from these stats that the majority of executions consist of recurring tasks at 88%. You can use the execution.persistence stats to evaluate the ratio of consumed capacity, but on their own, you should not make assumptions about the sufficiency of the available capacity.

To assess the capacity, you should evaluate these stats against the load under stats.runtime.value:

		{
    "load": {
        "p50": 40,
        "p90": 40,
        "p95": 60,
        "p99": 80
    }
}
		
	

You can infer from these stats that it is very unusual for Task Manager to run out of capacity, so the capacity is likely sufficient to handle the amount of non-recurring tasks.

Suppose you have an alternate scenario, where you see the following output under stats.runtime.value.execution.persistence:

		{
  "recurring": 60,
  "non_recurring": 40,
},
		
	

60% of executed tasks are recurring tasks
40% of executed tasks are non-recurring tasks

You can infer from these stats that even though most executions are recurring tasks, a substantial percentage of executions are non-recurring tasks at 40%.

Evaluating the load under stats.runtime.value, you see the following:

		{
    "load": {
        "p50": 70,
        "p90": 100,
        "p95": 100,
        "p99": 100
    }
}
		
	

You can infer from these stats that it is quite common for this Kibana instance to run out of capacity. Given the high rate of non-recurring tasks, it would be reasonable to assess that there is insufficient capacity in the Kibana cluster to handle the amount of tasks.

Keep in mind that these stats give you a glimpse at a moment in time, and even though there has been insufficient capacity in recent minutes, this might not be true in other times where fewer non-recurring tasks are used. We recommend tracking these stats over time and identifying the source of these tasks before making sweeping changes to your infrastructure.

Evaluate the Workload

Predicting the required throughput a deployment might need to support Task Manager is difficult, as features can schedule an unpredictable number of tasks at a variety of scheduled cadences.

Health monitoring provides statistics that make it easier to monitor the adequacy of the existing throughput. By evaluating the workload, the required throughput can be estimated, which is used when following the Task Manager Scaling guidance.

Evaluating the preceding health stats in the previous example, you see the following output under stats.workload.value:

		{
  "count": 26,
  "task_types": {
    "alerting:.index-threshold": {
      "count": 2,
      "status": {
        "idle": 2
      }
    },
    "actions:.index": {
      "count": 14,
      "status": {
        "idle": 2,
        "running": 2,
        "failed": 10
      }
    },
    "alerting:xpack.uptime.alerts.monitorStatus": {
      "count": 10,
      "status": {
        "idle": 10
      }
    },
  },
  "non_recurring": 0,
  "owner_ids": 1,
  "schedule": [
    ["10s", 2],
    ["1m", 2],
    ["90s", 2],
    ["5m", 8]
  ],
  "overdue_non_recurring": 0,
  "overdue": 0,
  "estimated_schedule_density": [
    0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
    0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
    0, 3, 0, 0, 0, 1, 0, 1, 0, 1,
    0, 0, 0, 1, 0, 0, 1, 1, 1, 0
  ],
  "capacity_requirements": {
    "per_minute": 14,
    "per_hour": 240,
    "per_day": 0
  }
}
		
	

There are 26 tasks in the system, including regular tasks, recurring tasks, and failed tasks.
There are 2 idle index threshold alert tasks, meaning they are scheduled to run at some point in the future.
Of the 14 tasks backing the ES index action, 10 have failed and 2 are running.
There are no non-recurring tasks in the queue.
There is one Task Manager actively executing tasks. There might be additional idle Task Managers, but they aren’t actively executing tasks at this moment in time.
A histogram of all scheduled recurring tasks shows that 2 tasks are scheduled to run every 10 seconds, 2 tasks are scheduled to run once a minute, and so on.
There are no overdue non-recurring tasks. Non-recurring tasks are usually scheduled to execute immediately, so overdue non-recurring tasks are often a symptom of a congested system.
There are no overdue tasks, which means that all tasks that should have run by now have run.
This histogram shows the tasks scheduled to run throughout the upcoming 20 polling cycles. The histogram represents the entire deployment, rather than just this Kibana instance.
The capacity required to handle the recurring tasks in the system. These are buckets, rather than aggregated sums, and we recommend evaluating the Capacity Estimation section, rather than evaluating these buckets yourself.

The workload section summarizes the work load across the cluster, listing the tasks in the system, their types, schedules, and current status.

You can infer from these stats that a default deployment should suffice. This assessment is based on the following:

The estimated schedule density is low.
There aren’t many tasks in the system relative to the default capacity.

Suppose the output of stats.workload.value looked something like this:

		{
  "count": 2191,
  "task_types": {
    "alerting:.index-threshold": {
      "count": 202,
      "status": {
        "idle": 183,
        "claiming": 2,
        "running": 19
      }
    },
    "alerting:.es-query": {
      "count": 225,
      "status": {
        "idle": 225,
      }
    },
    "actions:.index": {
      "count": 89,
      "status": {
        "idle": 24,
        "running": 2,
        "failed": 63
      }
    },
    "alerting:xpack.uptime.alerts.monitorStatus": {
      "count": 87,
      "status": {
        "idle": 74,
        "running": 13
      }
    },
  },
  "non_recurring": 0,
  "owner_ids": 1,
  "schedule": [
    ["10s", 38],
    ["1m", 101],
    ["90s", 55],
    ["5m", 89],
    ["20m", 62],
    ["60m", 106],
    ["1d", 61]
  ],
  "overdue_non_recurring": 0,
  "overdue": 0,
  "estimated_schedule_density": [
    10, 1, 0, 10, 0, 20, 0, 1, 0, 1,
    9, 0, 3, 10, 0, 0, 10, 10, 7, 0,
    0, 31, 0, 12, 16, 31, 0, 10, 0, 10,
    3, 22, 0, 10, 0, 2, 10, 10, 1, 0
  ],
  "capacity_requirements": {
    "per_minute": 329,
    "per_hour": 4272,
    "per_day": 61
  }
}
		
	

There are 2,191 tasks in the system.
The scheduled tasks are distributed across a variety of cadences.
The schedule density shows that you expect to exceed the default 10 concurrent tasks.
There are 329 task executions that recur within the space of every minute.
There are 4,273 task executions that recur within the space of every hour.
There are 61 task executions that recur within the space of every day.

You can infer several important attributes of your workload from this output:

There are many tasks in your system and ensuring these tasks run on their scheduled cadence will require attention to the Task Manager throughput.
Assessing the high frequency tasks (tasks that recur at a cadence of a couple of minutes or less), you must support a throughput of approximately 330 task executions per minute (38 every 10 seconds + 101 every minute).
Assessing the medium frequency tasks (tasks that recur at a cadence of an hour or less), you must support an additional throughput of over 4,272 task executions per hour (55 every 90 seconds + 89 every 5 minutes, + 62 every 20 minutes + 106 each hour). You can average the needed throughput for the hour by counting these tasks as an additional 70 - 80 tasks per minute.
Assessing the estimated schedule density, there are cycles that are due to run upwards of 31 tasks concurrently, and along side these cycles, there are empty cycles. You can expect Task Manager to load balance these tasks throughout the empty cycles, but this won’t leave much capacity to handle spikes in fresh tasks that might be scheduled in the future.

These rough calculations give you a lower bound to the required throughput, which is at least 410 tasks per minute to ensure recurring tasks are executed, at their scheduled time. This throughput doesn’t account for nonrecurring tasks that might have been scheduled, nor does it account for tasks (recurring or otherwise) that might be scheduled in the future.

Given these inferred attributes, it would be safe to assume that a single Kibana instance with default settings would not provide the required throughput. It is possible that scaling horizontally by adding a couple more Kibana instances will.

For details on scaling Task Manager, see Scaling guidance.

Evaluate the Capacity Estimation

Task Manager is constantly evaluating its runtime operations and workload. This enables Task Manager to make rough estimates about the sufficiency of its capacity.

As the name suggests, these are estimates based on historical data and should not be used as predictions. These estimations should be evaluated alongside the detailed Health monitoring stats before making changes to infrastructure. These estimations assume all Kibana instances are configured identically.

We recommend using these estimations when following the Task Manager Scaling guidance.

Evaluating the health stats in the previous example, you can see the following output under stats.capacity_estimation.value:

		{
  "observed": {
    "observed_kibana_instances": 1,
    "minutes_to_drain_overdue": 1,
    "max_throughput_per_minute_per_kibana": 200,
    "max_throughput_per_minute": 200,
    "avg_recurring_required_throughput_per_minute": 28,
    "avg_recurring_required_throughput_per_minute_per_kibana": 28,
    "avg_required_throughput_per_minute": 28,
    "avg_required_throughput_per_minute_per_kibana": 28
  },
  "proposed": {
    "min_required_kibana": 1,
    "provisioned_kibana": 1,
    "avg_recurring_required_throughput_per_minute_per_kibana": 28,
    "avg_required_throughput_per_minute_per_kibana": 28
  }
}
		
	

These estimates assume that there is one Kibana instance actively executing tasks.
Based on past throughput the overdue tasks in the system could be executed within 1 minute.
Assuming all Kibana instances in the cluster are configured the same as this instance, the maximum available throughput is 200 tasks per minute.
On average, the recurring tasks in the system have historically required a throughput of 28 tasks per minute.
On average, regardless of whether they are recurring or otherwise, the tasks in the system have historically required a throughput of 28 tasks per minute.
One Kibana instance should be sufficient to run the current recurring workload.
We propose waiting for the workload to change before additional Kibana instances are provisioned.

The capacity_estimation section is made up of two subsections:

observed estimates the current capacity by observing historical runtime and workload statistics
proposed estimates the baseline Kibana cluster size and the expected throughput under such a deployment strategy

You can infer from these estimates that the current system is under-utilized and has enough capacity to handle many more tasks than it currently does.

Suppose an alternate scenario, where you see the following output under stats.capacity_estimation.value:

		{
  "observed": {
    "observed_kibana_instances": 2,
    "max_throughput_per_minute_per_kibana": 200,
    "max_throughput_per_minute": 400,
    "minutes_to_drain_overdue": 12,
    "avg_recurring_required_throughput_per_minute": 354,
    "avg_recurring_required_throughput_per_minute_per_kibana": 177,
    "avg_required_throughput_per_minute": 434,
    "avg_required_throughput_per_minute_per_kibana": 217
  },
  "proposed": {
    "min_required_kibana": 2,
    "provisioned_kibana": 3,
    "avg_recurring_required_throughput_per_minute_per_kibana": 118,
    "avg_required_throughput_per_minute_per_kibana": 145
  }
}
		
	

These estimates assume that there are two Kibana instance actively executing tasks.
The maximum available throughput in the system currently is 400 tasks per minute.
Based on past throughput the overdue tasks in the system should be executed within 12 minutes.
On average, the recurring tasks in the system have historically required a throughput of 354 tasks per minute.
On average, each Kibana instance utilizes 177 tasks per minute of its capacity to execute recurring tasks.
On average the tasks in the system have historically required a throughput of 434 tasks per minute.
The system estimates that at least two Kibana instances are required to run the current recurring workload.
The system recommends provisioning three Kibana instances to handle the workload.
Once a third Kibana instance is provisioned, the capacity utilized by each instance to execute recurring tasks should drop from 177 to 118 tasks per minute.
Taking into account historical ad-hoc task execution, we estimate the throughput required of each Kibana instance will drop from 217 task per minute to 145, once a third Kibana instance is provisioned.

Evaluating by these estimates, we can infer some interesting attributes of our system:

These estimates are produced based on the assumption that there are two Kibana instances in the cluster. This number is based on the number of Kibana instances actively executing tasks in recent minutes. At times this number might fluctuate if Kibana instances remain idle, so validating these estimates against what you know about the system is recommended.
There appear to be so many overdue tasks that it would take 12 minutes of executions to catch up with that backlog. This does not take into account tasks that might become overdue during those 12 minutes. Although this congestion might be temporary, the system could also remain consistently under provisioned and might never drain the backlog entirely.
Evaluating the recurring tasks in the workload, the system requires a throughput of 354 tasks per minute on average to execute tasks on time, which is lower then the estimated maximum throughput of 400 tasks per minute. Once we take into account historical throughput though, we estimate the required throughput at 434 tasks per minute. This suggests that, historically, approximately 20% of tasks have been ad-hoc non-recurring tasks, the scale of which are harder to predict than recurring tasks.

You can infer from these estimates that the capacity in the current system is insufficient and at least one additional Kibana instance is required to keep up with the workload.

For details on scaling Task Manager, see Scaling guidance.

Diagnose a root cause for late runs

Tasks with small schedule intervals run late

Problem:

Tasks are scheduled to run every 2 seconds, but seem to be running late.

Solution:

The Task Manager polls for tasks at the cadence specified by the xpack.task_manager.poll_interval setting, which is 3 seconds by default. This means that a task can run late if it uses a schedule that is smaller than this setting.

You can adjust the xpack.task_manager.poll_interval setting. However, this adds some load to both Kibana and Elasticsearch instances in the cluster, as they have to perform more queries.

Tasks run late

Problem:

The most common symptom of an underlying problem in the Task Manager is that tasks appear to run late. For instance, recurring tasks might run at an inconsistent cadence, or long after their scheduled time.

Solution:

By default, Kibana polls for tasks at a rate of 10 tasks every 3 seconds.

If many tasks are scheduled to run at the same time, pending tasks queue in Elasticsearch. Each Kibana instance then polls for pending tasks at a rate of up to 10 tasks at a time, at 3 second intervals. It is possible for pending tasks in the queue to exceed this capacity and run late as a result.

This type of delay is known as drift.The root cause for drift depends on the specific usage, and there are no hard and fast rules for addressing drift.

For example:

If drift is caused by an excess of concurrent tasks relative to the available capacity of Kibana instances in the cluster, you can expand the throughput of the cluster.
If drift is caused by long running tasks that overrun their scheduled cadence, you can reconfigure the tasks in question.

Refer to Diagnose a root cause for drift for step-by-step instructions on identifying the correct resolution.

Drift is often addressed by adjusting the scaling of the deployment to better suit your usage. For details on scaling the Task Manager, refer to Scaling guidance.

What do I do if the Task’s `runAt` is in the past?

Problem:

Tasks' property runAt is in the past.

Solution:

Wait a bit before declaring it as a lost cause, as Task Manager might just be falling behind on its work. You should take a look at the Kibana log and see what you can find that relates to Task Manager. In a healthy environment you should see a log line that indicates that Task Manager was successfully started when Kibana was:

		server log [12:41:33.672] [info][plugins][taskManager][taskManager] TaskManager is identified by the Kibana UUID: 5b2de169-2785-441b-ae8c-186a1936b17d
		
	

If you see that message and no other errors that relate to Task Manager, it’s most likely that Task Manager is running fine and has simply not had the chance to pick the task up yet. If, on the other hand, the runAt is severely overdue, then it’s worth looking for other Task Manager or alerting-related errors, as something else may have gone wrong. It’s worth looking at the status field, as it might have failed, which would explain why it hasn’t been picked up or it might be running which means the task might simply be a very long running one.

Frequently asked questions

Inline scripts are disabled in Elasticsearch

Problem:

Tasks are not running, and the server logs contain the following error message:

		[warning][plugins][taskManager] Task Manager cannot operate when inline scripts are disabled in Elasticsearch
		
	

Solution:

Inline scripts are a hard requirement for the Task Manager to function. To enable inline scripting, check the Elasticsearch documentation for configuring allowed script types setting.

What do I do if the Task is marked as failed?

Problem:

Tasks marked as failed.

Solution:

Broadly speaking the Alerting framework is meant to gracefully handle the cases where a task is failing by rescheduling a fresh run in the future. If this fails to happen, then that means something has gone wrong in the underlying implementation and this isn’t expected. Ideally you should try and find any log lines that relate to this rule and its task, and use these to help us investigate further.

Task Manager Kibana Log

Task manager will write log lines to the Kibana Log on certain occasions. Below are some common log lines and what they mean.

Task Manager has run out of Available Workers:

		server log [12:41:33.672] [info][plugins][taskManager][taskManager] [Task Ownership]: Task Manager has skipped Claiming Ownership of available tasks at it has ran out Available Workers.
server log [12:41:33.672] [warn][plugins][taskManager][taskManager] taskManager plugin is now degraded: Task Manager is unhealthy - Reason: setting HealthStatus.Error because of expired hot timestamps

This log message tells us that Task Manager is not managing to keep up with the sheer amount of work it has been tasked with completing. This might mean that rules are not running at the frequency that was expected (instead of running every 5 minutes, it runs every 7-8 minutes, just as an example).

By default Task Manager is limited to 10 tasks and this can be bumped up by setting a higher number in the kibana.yml file using the xpack.task_manager.capacity configuration. It is important to keep in mind that a higher number of tasks running at any given time means more load on both Kibana and Elasticsearch; only change this setting if increasing load in your environment makes sense.

Another approach to addressing this might be to tell workers to run at a higher rate, rather than adding more of them, which would be configured using xpack.task_manager.poll_interval. This value dictates how often Task Manager checks to see if there’s more work to be done and uses milliseconds (by default it is 3000, which means an interval of 3 seconds).

Before changing either of these numbers it’s highly recommended to investigate what Task Manager can’t keep up - Are there an unusually high number of rules in the system? Are rules failing often, forcing Task Manager to re-run them constantly? Is Kibana under heavy load? There could be a variety of issues, none of which should be solved by simply changing these configurations.

Task TaskType failed in attempt to run:

		server log [12:41:33.672] [info][plugins][taskManager][taskManager] Task TaskType "alerting:example.always-firing" failed in attempt to run: Unable to load resource ‘/api/something’
		
	

This log message tells us that when Task Manager was running one of our rules, it’s task errored and, as a result, failed. In this case we can tell that the rule that failed was of type alerting:example.always-firing and that the reason it failed was Unable to load resource ‘/api/something’ . This is a contrived example, but broadly, if you see a message with this kind of format, then this tells you a lot about where the problem might be.

For example, in this case, we’d expect to see a corresponding log line from the Alerting framework itself, saying that the rule failed. You should look in the Kibana log for a line similar to the log line below (probably shortly before the Task Manager log line):

		Executing Rule "27559295-44e4-4983-aa1b-94fe043ab4f9" has resulted in Error: Unable to load resource ‘/api/something’
		
	

This would confirm that the error did in fact happen in the rule itself (rather than the Task Manager) and it would help us pin-point the specific ID of the rule which failed: 27559295-44e4-4983-aa1b-94fe043ab4f9

We can now use the ID to find out more about that rule by using the http endpoint to find that rule’s configuration and current state to help investigate what might have caused the issue.