Solving for a knex error

eman31 · July 15, 2024, 2:16pm

I've been starting to get timeouts on a workflow for some reason and need help troubleshooting the issue.

The workflow is set up to:
(1) select a number of rows from a Retool-hosted database,
(2) use an ID from those rows to query an API and retrieve data, then
(3) update those same rows from (1) with the API data using a "Bulk update via a primary key" database query.

Everything has been running smoothly until last week, when the workflow started timing out on the (3) step. I've been getting vague "timeout" errors, like this one:

{"data":null,"metadata":{},"error":"Internal Error running a block: TypeError: Failed to fetch"}

After playing around with different (3) queries, I finally got something more detailed with a batched Loop block, which is giving me the following error:

Error evaluating updateViewcount (iteration 16): Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?

Based on other research I was doing to try to fix the issue (this and this), the error seems to suggest that the size of the update is the issue, but I'm unsure how to proceed from here if the Loop block isn't chunking the update properly.

I haven't been able to find any other information about the Knex error as it pertains to Retool. It seems like a general node.js issue that's documented elsewhere (here, for example), but I'm unsure how to apply this to Retool.

See below for my Loop settings:

Some other context that might be relevant: the amount of rows did not change from the day the workflow was working to the next day when it didn't, so I initially ruled out the issue as "too many rows being updated." I was able to align the timing to when I logged into my Retool instance on a slower computer than usual, so I thought maybe that was causing the timeout issue somehow.

That seemed far-fetched, although when I check the rows in the database, the data is being updated despite the workflow saying it failed, so I'm not sure what to make of that.

I haven't been able to log into a faster machine yet to test this theory definitively. It seems strange that a backend workflow would fail because of a slower local machine that is logged into the Retool instance... right?

Appreciate the help!

eman31 · July 24, 2024, 6:25pm

Update: this is definitely not the issue. Still getting the errors on a new, faster machine.

the data is being updated despite the workflow saying it failed, so I'm not sure what to make of that.

Verified this is the case, so seems like a phantom error? Not sure what's going on but would appreciate some help to figure out if these errors are to be trusted going forward.

Darren · July 30, 2024, 9:50pm

Hey @eman31! Thanks for reaching out. It's been a while since we've seen similar behavior but it's not unprecedented.

In broad strokes, each resource in your workflow is provisioned with a limited "pool" of connections and it seems that we're maxing it out. The goal, then, is to limit the number of simultaneous connections. For the sake of testing, you might want to use the "Sequential" execution mode just to see if we experience the same error. If that works, we can then adjust the batch size and delay to improve the overall performance of your query.

It would also be super helpful to know how much data you're working with - approximately how many records are there in transformViewData.data?

eman31 · August 2, 2024, 4:15pm

Thanks for the reply @Darren !

Just tested out the "Sequential" mode with a 5ms iteration delay and getting a "Internal Error running a block: TypeError: Failed to fetch" error.

The array that is the loop input has 762 objects to update with 4 keys/columns, including the primary id key

Darren · August 2, 2024, 5:05pm

All right, that's a much more standard - if not generic - error. It doesn't give us a lot of information, though.

I did realize the other day that it doesn't make a ton of sense to perform a bulk update operation in combination with a a loop block; we probably want to choose one or the other. For the sake of testing, can you change your logic to either update a single record within the loop or do a bulk update without the loop? That should help us to further narrow down what's happening here.

Also, is your data in RetoolDB still updating correctly? If that's the case, we could create a new global error handling block that would allow you to continue executing logic even if your query does throw this error. It's not ideal but would unblock you.

If you do go down this route, I'd definitely recommend checking the error message inside the handler in order to ensure you're still throwing all other errors correctly.

I'm going to spend a bit of time today trying to reproduce this behavior, as I have a hunch that there may be an underlying bug at play.

eman31 · August 5, 2024, 8:39pm

Ahh good shout. The original workflow had a single bulk update, which then started erroring out and prompted me to shift to the loop. But I didn't think to switch the type of update to a single row.

Just retested the bulk update and got a timeout error: "Timeout Error: The block run exceeded the timeout of 10000 ms. This timeout can be increased in the block's 'Settings'."

I then followed that instruction and updated the timeout to 30k, got the same error, then 60k, and that went through! Seems like it's taking just over 30secs to complete. Strange, because before it started failing, I didn't have many less than the same 762 records to update. So I'm not sure why the timeout would need to be 3x'd in order for it to run correctly.

I did the sequential loop with the single update afterwards just to see what happened and if it was a faster query. It's been about 5+ minutes since I ran it and it still seems to be running, so I'm going to stick with the bulk update and a high timeout setting for now. But will report back if anything happens with the loop block.

Re the error handling block: this update block was the end of the workflow, so I didn't need to run anything else after it. But I did have a global error block set up to catch issues with the update, which is how I caught this issue.

Darren · August 5, 2024, 9:26pm

Glad to hear it! Just for the sake of testing you can try the loop again with batching enabled - I'd start with a batch size of 10 or so - but if the single bulk update works and isn't taking too crazy long, feel free to stick with that.

eman31 · August 5, 2024, 9:39pm

The single update took 14 minutes so I'll play around. Thanks @Darren!