Workflow connection errors after recent upgrade

I recently upgraded our self-hosted Retool setup from 3.52.12 to 3.114.8. After the upgrade, however, I'm running into connection errors with the workflow service. Further details include:

  • Our setup is deployed on AWS using ECS Fargate
  • We are deploying with Terraform using the Retool managed Terraform module
  • I've upgraded our Terraform module to use the latest
  • The observed errors are as follows:
    • One of our workflows that runs at a specified time each day is failing
    • When I trigger a manual run of the workflow, the workflow fails, but with no logs to indicate why it failed (see screenshot below)
  • The logs for the retool-workflows-worker-service are showing an ECONNREFUSED error on a couple of endpoints on the workflow backend at the time of workflow runs
  • I don't see any errors in the logs for the retool-workflows-backend-service, however at the current time of investigation I'm noticing that the task was restarted ~7 hours ago, and there appear to be some logs missing from when I would expect to see them. Is the backend service intended to be ephemeral? If not, perhaps it is crashing, and that is the cause of the error?
  • Our Retool applications are working as expected without any errors, it appears that only Retool workflows are affected
  • As far as I can tell, we're using the default settings for the workflows setup as provided by the Retool Terraform module
  • Both the workflows-worker service and the workflows-backend service are in the same security group with access to all ports enabled for inter-security group communications (as provided by the default settings)

Does anyone know what might be going on here, or what I should look into next?

Observed error from the workflows UI:

(Edits: Adding more details as I continue debugging)

I figured out what was going on: it looks like more memory is needed for running a workflow than previous versions, and the workflow-backend service was running out of memory. We previously had that limited to 2 GiB; I increased the memory allocation for workflow-backend containers to 4 GiB, and it looks like the workflows are running properly again!

1 Like