Been using Retool workflows since late last year and have encountered many errors which I usually understand that context of. Except this apparently new(?) one that I've encountered 3 times; all just today, all on Python code blocks. However, it does not happen on every run of the workflow. Just need some help understanding what it is? And how to reproduce and prevent it?
This is not an error that is typically expected and can necessarily be guarded against. Could you provide more information about the workflow runs, such as the times it occurred and if you are using any experimental workflow features?
Not using anything experimental, as far as I know.
One occurred last night at 2024-01-29 10:40PM ET and another 2 in another workflow occurred this morning, both at 2024-01-30 11:06AM ET.
Both code blocks were for python using pandas, numpy and datetime packages, and were for data cleaning and to extract what was needed for the rest of the workflow.
the underlying problem isn't something I can help you fix, especially since @tanay-retool is here XD, but I think I can answer the question you asked. It's probably worth adding that I'm not really a server/web guy so the actual workings might differ.
lets look at NGINX and how it pertains to your situation. NGINX is a web server used for a bunch of stuff like reverse proxy, load balancing and serving web content. One of the things NGINX implements is a WSGI Server (Web Server Gateway Interface) and it's job is to serve Python applications (I think technically it forwards the response to NGINX in this case, but same end result).
so we now have the following flow of how NGINX works:
python resources -> WSGI Server -> NGINX Server -> your browser
next, lets get the official definition of a '502 Bad Gateway' response from the IETF RFC-7231:
The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.
in a nutshell, NGINX is telling us it encountered an error somewhere AFTER receiving the request meaning the error is not technically from NGINX but from something upstream that has propagated on down to NGINX which is now only able to inform us that something happened but it doesn't know the specifics. Since you stated you're using Python and the error message is referring to a gateway and now knowing the WSGI Server is a gateway (API Gateway) we can guess the problem is coming from there... or at min, it should be the best place to start looking
you might be able to implement a 'request retry' in the Python code blocks using the odd and unique for/else
for _ in range(MAX_RETRIES):
try:
# do questionable stuff here
except SomeTransientError:
# you probably want to sleep here for a bit first to pause between retries
continue
else:
break
else:
raise PermanentError()
I think workflows have a max run time and if they do you'll wanna be extra careful with how long it sleeps and the max retries or you could end up erroring out for a completely different reason
unfortunately this isn't really something I can test. the other question of yours (how to reproduce this) I can't answer without spinning up a bunch of stuff myself and I think at that point all I'm doing is putting more emphasis on where the problem could be but since all I can do is emulate I wouldn't be able to confirm.... which is why I can't really test the workaround
Thank you so much for this! Makes a lot of sense and will test the work around you proposed. Really appreciate it!