Significant Variation in Query Performance

pod2 · November 21, 2024, 6:23pm

Hi all,

We recently noticed that two of our more important queries were timing out during morning business hours (EST). Typically, the issue will resolve by 10:30 am and the queries will begin performing significantly faster. The cutoff seems to vary day to day, with some days taking longer than others.

We've done a good bit of testing on our end and have comfirmed there are no concurrent processes running on our end (PostgreSQL database on AWS) that might interfere.

I'm curious if anyone else has experienced anything like this, and if you eventually found a root cause.

Take a look at these stats from this morning:

Darren · December 5, 2024, 12:43am

Hi @pod2! Thanks for reaching out.

This is something that we've heard from a few different customers and identified in our own logs, as well. We recently updated the status page in order to reflect this finding, have rolled out a potential fix, and will continue monitoring everything!

pod2 · December 13, 2024, 3:57pm

Thanks @Darren - appreciate the attention to the issue!

We still see significant fluctuations in query times. I'll continue logging some instances, but we're ranging from "blink of an eye" to "40s timeout" on the same query, sometimes within a 15 minute window or less.

Darren · December 14, 2024, 2:05am

Appreciate the update! In general, do you still tend to see these fluctuations during morning business hours? I'll take a closer look at your org's backend logs to see if there's any discernible pattern!

pod2 · December 19, 2024, 4:20pm

Hi Darren,

That's what I have observed, but this may be biased as most of the flags come in earlier in the morning based on users logging in at start-of-business and attempting to use the insights. After a few failed attempts throughout the morning I think usage tapers off.

Is it possible for us to access similar logs to put together our own reporting on the issue?

Darren · December 20, 2024, 1:57am

You won't have access to the full backend logs, due to the fact that you're running a Cloud instance, but you can see essential user activity - including query runs and their responseTime - at https://your-org.retool.com/audit.

On my end, it looks like there might be a slight tendency for the timeouts you're seeing to fall during the first couple hours of the work day over there, but it's not perfectly consistent. The majority of them also seem to occur when running one query, in particular. There's another query with a significant number of timeouts, but it's a distant second.

Other than that, there's nothing that immediately jumps out as particularly meaningful. It appears the Retool backend successfully queries your backing database after ~200 milliseconds, waits ~30 seconds without a response, and then terminates the connection. I'll try to get another pair of eyes on this, just in case there's something I'm missing.