Workflow fails for no visible reason

We have a workflow that runs once per day and brings data from a source to a destination.
We chose to loop over the entities and upsert them to the destination, since the same entity can be brought up multiple days in a row.

We started noticing at some point that some days the workflow fails, but we're not sure why, as there are no error messages printed or something.

We set up 3 retry attempts per iteration and 3 retry attempts for the entire loop and that still didn't resolve it.

Would appreciate any thoughts or ideas around how we can find out the reason & prevent it from failing again.

Thanks!

Something similar is happening to me! I am triggering a Workflow from an App. The workflow runs fine, but the app shows me this
image
And its generating failures in the Mobile interface. But the workflow is runnign with no problems

Update
Solved for me: The workflow had a brach which send a response to the App. The problem was because in the cases the workflow run the branch that didnt had the response, the App would wait for a response and end with an timeout because there is no response. Solved with a response for all cases.

is there a chance using a 3rd party service might work to help tet some more logging info? Session replay integration

thats an example of how something similar could be added to a custom component for logging reasons, im still on vacation tho and am unable to see if theres a wayto implement somryhing similar in a workflow instead.

hopefully thos might give you a starting point.

This issue still happens to us unfortunately. We tried many different approaches but we can't understand what's the reason for failure, as it's not a timeout or an error in the process.
It just "drops" the process in the middle, kind of.

These are our recent runs - as you can see we saw that it failed multiple 3 times in a row so we went ahead and manually triggered it - where it again, failed twice, but on the third attempt it suddenly succeeded.
The data itself is identical in the last 3 triggers so we can't see any reason why it would fail the first 2 times.
The first run failed after 56.87 seconds, the second failed after 43.256 seconds, and the successfull run took 101 seconds in total - so we can clearly see that the runs are dropped in the middle of the process...

image

Im currently facing the same issue with workflows (specifically when I have AI blocks).
When I run blocks individually, everything is working fine and nothing time outs.
However, when I try running all the workflow (from the app and the workflow bar), it just fails. Tried adding global error handling to debug the issue, nothing shows up.

Any ideas?

Yep, this is exactly what I'm facing, and I found no possible way of debugging it or understanding why it fails.
The fact that the workflow is unstable is bad on it's own, having no way to debug it or understand what makes it fail is terrible.

I hope someone from the support team somehow gets to this topic, since I had no luck when contacting them by mail.

@Shai-D I noticed that AI blocks are the ones failing without reason. I was able to partially achieve my goal by using pure python along with open ai library. The downside is that you won't be able to leverage the Vector store.

it could be a pain in the butt, but you could use the OpenAI Assistants API to create an Assistant w access to the 'Knowledge Retrieval' and 'Function Calling' tools to achieve the same thing. since in the background Retool Vectors uses the Embeddings API ur results shouldn't be much different, if at all, so u can easily switch back to the Retool AI resource if u ever decide to.

Hey @Shai-D, sorry to hear your workflow isn't working as expected! I'm an engineer on the team, I will follow up with you via email.

@nickfind @bobthebear

Unfortunately I am not even using any AI blocks or something like that in my workflow, its a purely deterministic workflow, and it's not even that complex :sweat:

Yuke has reached out to me by email, we will try to debug and understand what's going there and I guess one of us will update here if we have any smart things to add :slight_smile:

@nickfind @pedrowach @bobthebear

So apparently the issue was that the workflow has run out of memory and crashed :frowning:
We optimized the workflow to use less memory and generate less logs and stuff to try to reduce the memory usage and it now works.

Basically in the past we were iterating over ~10k records, upserting them in a loop one by one.
Now we are batching them, 10 batches of 500 records at once, which makes it more efficient

Well that's interesting and rather nice to know about. Just out of curiosity were yall able to find out about how much memory is available?

Courtesy of the Retool Engineering team