PostgreSQL query failing

We see random timeouts on PostgreSQL queries where most of the time the query returns in ~2s but sometimes will take upwards of 2 minutes and timeout.

We access this database from many other interfaces and parts of our app and are only seeing these timeouts when we query from retool.

  • Goal: I want my postgres queries to consistently return without timing out.

  • Steps: I've re-run the queries and ~80% of the time they return in a normal amount of time (~5s) but 20% of the time they take 2min+.

  • Details: Any query that returns a pretty large amount of data will have this random timeout behavior. (eg. query below returns 87k rows / 13MB of data)

Successful run (~5.6s)

Another example, event simpler / less data (68k rows, ~3MB) -- with 20s set timeout:

example failure (> 20s 3 retries):

example success (2.2s):

Attended office hours on 3/26 and talked with JoeB and others on the call.

Some additional details:

  • This timeout issue seems more likely on queries that return more data. I had one query that was exhibiting this behavior, I changed it to return less data and the issue went away.

  • Lowering the timeout + adding retries seems to help. Once the query goes into this "timed out" state it seems unrecoverable, so giving up on that attempt and retrying has helped a lot to mitigate the issue. It still happens + takes a couple attempt to succeed some times but definitely seems random when it happens.

  • We access this same postgres instance from our external webapp (not retool) and we don't see any timeouts similar to this.

  • Discussed possible connection pool issues. My confusion with that is its very binary - either it would return quickly (~2s) or it would timeout over 120s. My expectation is that if it were waiting a while for connections it would sometimes return in ~30s, sometimes ~70s -- but I never saw this case occur.

    • My guess is maybe when it runs, if the connection pool is full it becomes unrecoverable and needs to run again to force it to check the connection pool again? And that's why retries help?