Initial setup fails: api pod crash-loop

I have followed the simple setup instructions exactly:

Unfortunately, the api pod never successfully passes the initial stage of checking the database migrations. The api pod gives the following error message: waiting for postgres:5432 without a timeout postgres:5432 is available after 0 seconds
not untarring the bundle
[process service types] [ 'MAIN_BACKEND', 'DB_CONNECTOR', 'DB_SSH_CONNECTOR' ]
Failing checking database migrations
(node:17) UnhandledPromiseRejectionWarning: SequelizeConnectionError: password authentication failed for user "retool_internal_user"
    at /snapshot/retool_development/node_modules/sequelize/lib/dialects/postgres/connection-manager.js:182:24
    at Connection.connectingErrorHandler (/snapshot/retool_development/node_modules/pg/lib/client.js:194:14)
    at Connection.emit (events.js:314:20)
    at Socket.<anonymous> (/snapshot/retool_development/node_modules/pg/lib/connection.js:134:12)
    at Socket.emit (events.js:314:20)
    at addChunk (_stream_readable.js:297:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at Socket.Readable.push (_stream_readable.js:213:10)
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
(node:17) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see (rejection id: 1)
(node:17) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Every time it restarts, this corresponding error message appears in the postgres pod:

2022-07-29 14:34:59.020 UTC [110] LOG:  incomplete startup packet
2022-07-29 14:34:59.670 UTC [111] FATAL:  password authentication failed for user "retool_internal_user"
2022-07-29 14:34:59.670 UTC [111] DETAIL:  Password does not match for user "retool_internal_user".
	Connection matched pg_hba.conf line 95: "host all all all md5"

The api pod is clearly connecting to the postgres pod (given the error message that appears for each startup attempt).

Naturally it looks like the password is the problem, but I've checked the environment variables in both the postgres pod and the api pod, and the POSTGRES_PASSWORD is identical in both cases.

I tried connecting to the postgres database both by port-forward and by exec in the pod. In both cases I am able to connect to the hammerhead_production database using retool_internal_user. The database is empty.

The only strange thing is that I can connect as retool_internal_user using any password or no password, even when I'm port-forwarding from my local terminal -- not sure if that's normal.

I tried tryretool/backend:2.92.9 and tryretool/backend:2.94.9 -- same error in both cases. I am running on GKE.

The setup looked so impressively simple!! But I've wasted more than an hour poking around everywhere trying to figure out where the problem is.

I hope I'm just missing something stupid...?

Thanks in advance for any help.

C. Hamer

I tried again from scratch with version 2.93.9 and it worked.

Hey @chamer!

Good to hear you were able to get things up and running from scratch with 2.93.9. Are you able to successfully update to 2.94.9 or 2.95.3?

The password issue is one that sometimes occurs if the database is initialized and then the password is changed afterward in the retool-secrets.yaml file in which case it needs to be reset to whatever it was initialized as. We've also seen it be the case that there's a separate error occurring so checking for additional error logs can help here as well. Not sure if either of those are applicable here though :thinking:

Yes, I did upgrade it to 2.95.3.

I think my issue was just some error in the secrets.yaml file (with the underlying issue that I tried to do it at the end of the day on a Friday before a long weekend :stuck_out_tongue_winking_eye: ).

After starting from scratch the following Tuesday morning, everything has run fine without any issues.