Workflow connection errors after recent upgrade

edahlseng · February 5, 2025, 7:00pm

I recently upgraded our self-hosted Retool setup from 3.52.12 to 3.114.8. After the upgrade, however, I'm running into connection errors with the workflow service. Further details include:

Our setup is deployed on AWS using ECS Fargate
We are deploying with Terraform using the Retool managed Terraform module
I've upgraded our Terraform module to use the latest
The observed errors are as follows:
- One of our workflows that runs at a specified time each day is failing
- When I trigger a manual run of the workflow, the workflow fails, but with no logs to indicate why it failed (see screenshot below)
The logs for the retool-workflows-worker-service are showing an ECONNREFUSED error on a couple of endpoints on the workflow backend at the time of workflow runs
I don't see any errors in the logs for the retool-workflows-backend-service, however at the current time of investigation I'm noticing that the task was restarted ~7 hours ago, and there appear to be some logs missing from when I would expect to see them. Is the backend service intended to be ephemeral? If not, perhaps it is crashing, and that is the cause of the error?
Our Retool applications are working as expected without any errors, it appears that only Retool workflows are affected
As far as I can tell, we're using the default settings for the workflows setup as provided by the Retool Terraform module
Both the workflows-worker service and the workflows-backend service are in the same security group with access to all ports enabled for inter-security group communications (as provided by the default settings)

Does anyone know what might be going on here, or what I should look into next?

Observed error from the workflows UI:

(Edits: Adding more details as I continue debugging)

edahlseng · February 5, 2025, 8:13pm

I figured out what was going on: it looks like more memory is needed for running a workflow than previous versions, and the workflow-backend service was running out of memory. We previously had that limited to 2 GiB; I increased the memory allocation for workflow-backend containers to 4 GiB, and it looks like the workflows are running properly again!

Topic		Replies	Views
Workflow Scheduling Failure 💬 Workflows bug	5	274	August 16, 2024
Sudden constant 502s/504s when trying to access self-hosted retool 💬 Self Hosted Retool	4	422	July 22, 2024
New EC2 deployment jobs-runner_1 Exit 1 💬 Self Hosted Retool	7	568	May 7, 2024
EC2 hangs after a few hours 💬 Self Hosted Retool	7	831	May 26, 2023
All Retool Workflows Temporal executions fail 💬 Self Hosted Retool bug , workflows	3	882	October 31, 2023

Workflow connection errors after recent upgrade

Related topics