- My goal: View and edit a Workflow
- Issue: Page spins forever and the entire Retool instance is non-responsive
- Retool version & hosting setup (Docker, K8s, cloud provider, etc.): Docker, 3.284.5
- Database is a Postgres server hosted on AWS RDS in the same VPC
After doing an upgrade to 3.284.5, and trying to view a workflow, my entire Retool instance slows down and eventually times out on every request.
Using the RDS tools, I was able to track down a single query that never completes but eats up 100% of my RDS server capacity:
SELECT * FROM workflow_run AS wr1 WHERE "workflowId" IN ('d591ecb0-d9f3-4e2d-8632-13718db9be91', '6316b538-9bd2-4daf-8baf-998d528ec9ac') AND id IN ( SELECT id FROM workflow_run AS wr2 WHERE wr1."workflowId" = wr2."workflowId" ORDER BY "createdAt" DESC LIMIT 10 ) ORDER BY "workflowId", "createdAt" DESC LIMIT 20
Connecting to the DB, I can’t even run EXPLAIN ANALYZE on that, it just sits and takes forever until I cancel it.
workflow_run has 201k rows in it. My retool instance is set to have a 30 day retention window of logs, but I’m not sure if that is related to this particular table.
Slightly adjusting the query by limiting the results in the first SELECT and running it in PSQL shows me a big performance issue:
EXPLAIN ANALYZE with wr1 as ( SELECT * FROM workflow_run AS wr1 WHERE "workflowId" IN ('d591ecb0-d9f3-4e2d-8632-13718db9be91', '6316b538-9bd2-4daf-8baf-998d528ec9ac') LIMIT 10 ) select * from wr1 WHERE id IN ( SELECT id FROM workflow_run AS wr2 WHERE wr1."workflowId" = wr2."workflowId" ORDER BY "createdAt" DESC LIMIT 10 ) ORDER BY "workflowId", "createdAt" DESC LIMIT 20
That takes about 400ms to even show me the plan. When I change the first “LIMIT 10” to “LIMIT 500”, it takes 8 seconds to show me the plan.
Without the limit I imposed for testing, that first SELECT returns 20k rows.
EXPLAIN ANALYZE tells me the table isn’t indexed properly.
Copying the database and testing it, I was able to repro the issue, and fix it by dropping one index and adding a new index:
Do not do this on prod:
DROP INDEX CONCURRENTLY workflow_run_createdAt_index;
CREATE INDEX CONCURRENTLY idx_workflowid_createdat_desc
ON workflow_run("workflowId", "createdAt" DESC);
After doing this and rerunning my EXPLAIN ANALYZE, everything is super snappy and runs in less than 15 milliseconds.
I’m concerned that during my upgrade, a database migration failed to run that would have that index.
I followed these upgrade steps, one by one, making sure everything looked normal before completing the next one.
