I recently upgraded our self-hosted Retool setup from 3.52.12 to 3.114.8. After the upgrade, however, I'm running into connection errors with the workflow service. Further details include:
- Our setup is deployed on AWS using ECS Fargate
- We are deploying with Terraform using the Retool managed Terraform module
- I've upgraded our Terraform module to use the latest
- The observed errors are as follows:
- One of our workflows that runs at a specified time each day is failing
- When I trigger a manual run of the workflow, the workflow fails, but with no logs to indicate why it failed (see screenshot below)
- The logs for the
retool-workflows-worker-service
are showing anECONNREFUSED
error on a couple of endpoints on the workflow backend at the time of workflow runs - I don't see any errors in the logs for the
retool-workflows-backend-service
, however at the current time of investigation I'm noticing that the task was restarted ~7 hours ago, and there appear to be some logs missing from when I would expect to see them. Is the backend service intended to be ephemeral? If not, perhaps it is crashing, and that is the cause of the error? - Our Retool applications are working as expected without any errors, it appears that only Retool workflows are affected
- As far as I can tell, we're using the default settings for the workflows setup as provided by the Retool Terraform module
- Both the workflows-worker service and the workflows-backend service are in the same security group with access to all ports enabled for inter-security group communications (as provided by the default settings)
Does anyone know what might be going on here, or what I should look into next?
Observed error from the workflows UI:
(Edits: Adding more details as I continue debugging)