Troubleshooting and error handlingedit

Troubleshooting deployment errorsedit

You can view the status of deployment actions and get additional information on events, including why a particular event fails e.g. misconfiguration details.

  1. On the Applications page for serverlessrepo-elastic-serverless-forwarder, click Deployments.
  2. You can view the Deployment history here and refresh the page for updates as the application deploys. It should take around 5 minutes to deploy — if the deployment fails for any reason, the create events will be rolled back and you will be able to see an explanation for which event failed.

For example, if you don’t increase the visibility timeout for an SQS queue as described in Amazon S3 (via SQS event notifications), you will see a CREATE_FAILEDStatus for the event, and the Status reason provides additional detail.

Increase debug informationedit

To help with debugging, you can increase the amount of logging detail by adding an environment variable as follows:

  1. Select the serverless forwarder application from Lambda > Functions
  2. Click Configuration and select Environment Variables and choose Edit
  3. Click Add environment variable and enter LOG_LEVEL as Key and DEBUG as Value and click Save

Error handlingedit

There are two kind of errors that can occur during execution of the forwarder:

  1. Errors before the ingestion phase begins
  2. Errors during the ingestion phase
Errors before ingestionedit

For errors that occur before ingestion begins (1), the function will return a failure. These errors are mostly due to misconfiguration, incorrect permissions for AWS resources, etc. Most importantly, when an error occurs at this stage we don’t have any status for events that are ingested, so there’s no requirement to keep data, and the function can fail safely. In the case of SQS messages and Kinesis data stream records, both will go back into the queue and trigger the function again with the same payload.

Errors during ingestionedit

For errors that occur during ingestion (2), the situation is different. We do have a status for N failed events out of X total events; if we fail the whole function then all X events will be processed again. While the N failed ones could potentially succeed, the remaining X-N will definitely fail, because the data streams are append-only (the function would attempt to recreate already ingested documents using the same document ID).

So the forwarder won’t return a failure for errors during ingestion; instead, the payload of the event that failed to be ingested will be sent to a replay SQS queue, which is automatically set up during the deployment. The replay SQS queue is not set as an Event Source Mapping for the function by default, which means you can investigate and consume (or not) the message as preferred.

You can temporarily set the replay SQS queue as an Event Source Mapping for the forwarder, which means messages in the queue will be consumed by the function and ingestion retried for transient failures. If the failure persists, the affected log entry will be moved to a DLQ after three retries.

Every other error that occurs during the execution of the forwarder is silently ignored, and reported to the APM server if instrumentation is enabled.

Execution timeoutedit

There is a grace period of 2 minutes before the timeout of the Lambda function where no more ingestion will occur. Instead, during this grace period the forwarder will collect and handle any unprocessed payloads in the batch of the input used as trigger.

For CloudWatch Logs event, Kinesis data stream, S3 SQS Event Notifications and direct SQS message payload inputs, the unprocessed batch will be sent to the SQS continuing queue.