Common problemsedit

This section describes common problems you might encounter with APM Server.

503: Queue is fulledit

APM Server has an internal queue that buffers documents until they can be delivered to Elasticsearch. The internal queue helps to:

  • alleviate problems that might occur if Elasticsearch is intermittently unavailable
  • handle large spikes of data arriving at the APM Server at the same time
  • send documents to Elasticsearch in bulk, instead of individually

When the internal queue is full, APM Server returns an HTTP 503 status with the message "Queue is full".

A full queue generally means that the agents collect more data than APM server is able to deliver to Elasticsearch. This might happen when APM Server is not configured properly for the size of your Elasticsearch cluster, or because your Elasticsearch cluster is underpowered or not configured properly for the given workload.

The queue can also fill up if Elasticsearch is unavailable for a prolonged period, it runs out of disk space, or a sudden spike of data arrives at the APM Server.

If the APM Server only returns 503 responses, it might indicate that an Elasticsearch disk is full. If the APM Server returns interleaved 503 and 202 responses, it might indicate that the APM Server can’t process that much data.

To solve the "503: Queue is full" problem, you have a few options:

503: Request timed out waiting to be processededit

There is a limit to the number of requests that the APM Server can process concurrently. The APM Server returns an HTTP 503 status with the message "Request timed out waiting to be processed" when the limit is reached and the request from an agent is blocked. This limit is determined by the apm-server.concurrent_requests configuration parameter.

To alleviate this problem, you can try to:

Solutionsedit

Tune APM Server output parameters for your Elasticsearch clusteredit

If your Elasticsearch cluster is sized properly, but not ingesting the amount of data you expect, you can tweak APM Server options to make better use of the cluster:

  • Increase output.elasticsearch.workers
  • Make sure that output.elasticsearch.bulk_max_size is high, for example 5120. The default of 50 is very conservative.

    If you increase bulk_max_size, make sure to also increase queue.mem.events. A good rule of thumb is that queue.mem.events should equal output.elasticsearch.worker multiplied by output.elasticsearch.bulk_max_size.

Tune Elasticsearch for higher ingestionedit

Increasing the Elasticsearch ingestion rate is a large topic. See Tune for indexing speed, particularly the sections about how to increase the refresh interval, disable swapping, use faster hardware, and set the indexing buffer size.

Reduce the amount of data collectededit

The most obvious way to reduce the number of documents to be indexed is to reduce the transaction sample rate. The transaction sample rate is controlled in the configuration of agents (for example for Python and Node.js).

Reducing the transaction sample rate does not affect the collection of metrics such as Requests Per Second.

Increase internal queue sizeedit

A larger internal queue allows Elasticsearch to be unavailable for longer periods, and it alleviates problems that might result from sudden spikes of data. You can increase the queue size by overriding queue.mem.events. Be aware that increasing queue.mem.events can significantly affect APM Server memory usage.

Delete old dataedit

The internal queue might be full because Elasticsearch ran out of disk space and started rejecting insertions. If this happens, you will need to remove old indices to make room for new data. To do this you can use a tool like Curator and set up a cron job to run it periodically.

By default APM indices have the pattern apm-%{[beat.version]}-{type}-%{+yyyy.MM.dd}. With the curator command line interface you can, for instance, see all your existing indices:

curator_cli --host localhost show_indices --filter_list '[{"filtertype":"pattern","kind":"prefix","value":"apm-"}]'

apm-6.3.0-error-2018.05.10
apm-6.3.0-error-2018.05.11
apm-6.3.0-error-2018.05.12
apm-6.3.0-sourcemap
apm-6.3.0-span-2018.05.10
apm-6.3.0-span-2018.05.11
apm-6.3.0-span-2018.05.12
apm-6.3.0-transaction-2018.05.10
apm-6.3.0-transaction-2018.05.11
apm-6.3.0-transaction-2018.05.12

And then delete any span indices older than 1 day:

curator_cli --host localhost delete_indices --filter_list '[{"filtertype":"pattern","kind":"prefix","value":"apm-6.3.0-span-"}, {"filtertype":"age","source":"name","timestring":"%Y.%m.%d","unit":"days","unit_count":1,"direction":"older"}]'

INFO      Deleting selected indices: [apm-6.3.0-span-2018.05.10, apm-6.3.0-span-2018.05.11]
INFO      ---deleting index apm-6.3.0-span-2018.05.10
INFO      ---deleting index apm-6.3.0-span-2018.05.11
INFO      "delete_indices" action completed.

You can read more on Curator here: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html

Reduce the payload sizeedit

Large payloads coming from agents may result in "503: Request timed out waiting to be processed" messages. You can reduce the payload size by decreasing the max_queue_size in the agents. This will result in agents sending smaller payloads to the APM Server, but the requests will be more frequent. See the documentation for the Python and Node.js agents for more information.

Add APM Server nodesedit

When the APM Server cannot process incoming requests quickly enough, you will see "503: Request timed out waiting to be processed" messages.

The best way to avoid this problem is to add more processing power to your APM Server cluster. You can either migrate your APM Server processes to more powerful machines or add more machines.