This guide shows you how to gracefully handle failed workflow runs. This involves best practices on resolving runtime errors, logging and manually retrying runs that have failed multiple times.

Why a workflow might fail

  • A step in your workflow throws a database error that causes your code to fail at runtime.
  • QStash calls your workflow URL, but the URL is not reachable - for example, because of a temporary outage of your deployment platform.
  • A single step takes longer than your platform’s function execution limit.

QStash automatically retries a failed step three times with exponential backoff to allow temporary outages to resolve.

A failed step is automatically retried three times

If, even after all retries, your step does not succeed, we’ll move the failed run into your Dead Letter Queue (DLQ). That way, you can always manually retry it again and debug the issue.

Manually retry from the Dead-Letter-Queue (DLQ)

If you want to take an action (a cleanup/log), you can configure either failureFunction or a failureUrl on the serve method of your workflow. These options allow you to define custom logic or an external endpoint that will be triggered when a failure occurs.

This feature is not yet available in workflow-py. See our Roadmap for feature parity plans and Changelog for updates.

The serve function you use to create a workflow endpoint accepts a failureFunction parameter - an easy way to gracefully handle errors (i.e. logging them to Sentry) or your custom handling logic.

api/workflow/route.ts
export const { POST } = serve<string>(
  async (context) => {
    // Your workflow logic...
  },
  {
    failureFunction: async ({
      context,
      failStatus,
      failResponse,
      failHeaders,
    }) => {
      // Handle error, i.e. log to Sentry
    },
  }
);

Note: If you use a custom authorization method to secure your workflow endpoint, add authorization to the failureFunction too. Otherwise, anyone could invoke your failure function. Read more here: securing your workflow endpoint.

Using a failureUrl

This feature is not yet available in workflow-py. See our Roadmap for feature parity plans and Changelog for updates.

The failureUrl handles cases where the service hosting your workflow URL is unavailable. In this case, a workflow failure notification is sent to another reachable endpoint.

export const { POST } = serve<string>(
  async (context) => {
    // Your workflow logic...
  },
  {
    failureUrl: "https://<YOUR_FAILURE_URL>/workflow-failure",
  }
);

Debugging failed runs

In your DLQ, filter messages via the Workflow URL or Workflow Run ID to search for a particular failure. We include all request and response headers and bodies to simplify debugging failed runs.

For example, let’s debug the following failed run. Judging by the status code 404, the Ngrok-Error-Code header of ERR_NGROK_3200 and the returned HTML body, we know that the URL our workflow called does not exist.