New EC2 deployment jobs-runner_1 Exit 1

Im deploying a new self-hosted retool on EC2 and had it working for a day. This morning it went down. Running sudo docker-compose ps shows the following

1cdbb12e40eb_retool-onpremise-master_code- docker-entrypoint.sh bash ... Up 3004/tcp
executor_1
2217c7061ec5_retool-onpremise-master_workflows- docker-entrypoint.sh bash ... Up 3000/tcp, 3001/tcp, 3002/tcp
worker_1
c0e856ff2d2a_retool-onpremise-master_https- /init Up 0.0.0.0:443->443/tcp,:::443->443/tcp,
portal_1 0.0.0.0:80->80/tcp,:::80->80/tcp
f40adf419f4c_retool-onpremise-master_postgres_1 docker-entrypoint.sh postgres Up 5432/tcp
retool-onpremise-master_api_1 docker-entrypoint.sh bash ... Up 0.0.0.0:3000->3000/tcp,:::3000->3000/tcp,
3001/tcp, 3002/tcp
retool-onpremise-master_jobs-runner_1 docker-entrypoint.sh bash ... Exit 1
retool-onpremise-master_retooldb-postgres_1 docker-entrypoint.sh postgres Up 5432/tcp
retool-onpremise-master_workflows-backend_1 docker-entrypoint.sh bash ... Up 3000/tcp, 3001/tcp, 3002/tcp

When i run sudo docker-compose up -d it restarts jobs-runner_1. I can see jobs-runner_1 is up when running sudo docker-compose ps but after around 10 seconds, it goes down again.

When i run sudo docker-compose logs --tail 100 the portion for jub-runner_1 looks like this
jobs-runner_1 | - EXITING RETOOL -
jobs-runner_1 | ---------------------------
jobs-runner_1 |
jobs-runner_1 | Error running database migrations: SequelizeHostNotFoundError: getaddrinfo ENOTFOUND postgres
jobs-runner_1 | wait-for-it.sh: waiting 15 seconds for postgres:5432
jobs-runner_1 | wait-for-it.sh: timeout occurred after waiting 15 seconds for postgres:5432
jobs-runner_1 | not untarring the bundle
jobs-runner_1 | sed: can't read ./dist/mobile/*.js: No such file or directory
jobs-runner_1 | {"level":"info","message":"[process service types] JOBS_RUNNER","timestamp":"2024-05-03T13:40:36.497Z"}
jobs-runner_1 | Failing checking database migrations

Any idea how I can correct this?

Hey @matthewej, from your logs it looks like the jobs-runner is going down because it is unable to connect to your Postgres DB. Are you using the default Postgres db packaged with Retool or are you externalizing your database? Have you made any changes at all to your instance since it was last working?

1 Like

I'm using the default Postgres db. I haven't made any changes since the instance was first launched and running smoothly yesterday. I came in this morning and the jobs-runner was down

Could you share some more details about the specs for your deployment(vCPU, memory)? Would you be okay with sharing your docker-compose.yml here? In your docker.env do you have all the default values set for Postgres(DB Name: hammerhead_production, etc..)?

I'm deployed on a t3.2xlarge EC2 instance (8vCPU 32GB Memory 60GB Storage)

docker.env has all defaults except for DOMAINS and LICENSE_KEY

This is the contents of the docker-compose.yml file

version: "2"
services:
api:
build:
context: ./
dockerfile: Dockerfile
env_file: ./docker.env
environment:
- DEPLOYMENT_TEMPLATE_TYPE=docker-compose
- SERVICE_TYPE=MAIN_BACKEND,DB_CONNECTOR,DB_SSH_CONNECTOR
- DBCONNECTOR_POSTGRES_POOL_MAX_SIZE=100
- DBCONNECTOR_QUERY_TIMEOUT_MS=120000
- WORKFLOW_BACKEND_HOST=http://workflows-backend:3000
- CODE_EXECUTOR_INGRESS_DOMAIN=http://code-executor:3004
# If using Retool-managed Temporal cluster, leave workflow-related ENV vars commented
# If using self-managed cluster (Temporal Cloud or self-hosted) external to your Retool deployment, uncomment and update workflow-related ENV vars
# Compare deployment options here: https://docs.retool/self-hosted/concepts/temporal#compare-options
# - WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_HOST=temporal
# - WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_PORT=7233
# set these if using self-managed Temporal Cloud or require TLS for your self-managed Temporal cluster
# - WORKFLOW_TEMPORAL_TLS_ENABLED
# - WORKFLOW_TEMPORAL_TLS_CRT
# - WORKFLOW_TEMPORAL_TLS_KEY
networks:
- frontend-network
- backend-network
- code-executor-network
depends_on:
- postgres
- retooldb-postgres
- jobs-runner
- workflows-worker
- code-executor
command: bash -c "./docker_scripts/wait-for-it.sh postgres:5432; ./docker_scripts/start_api.sh"
links:
- postgres
ports:
- "3000:3000"
restart: on-failure
volumes:
- ./keys:/root/.ssh
- ./keys:/retool_backend/keys
- ssh:/retool_backend/autogen_ssh_keys
- ./retool:/usr/local/retool-git-repo
- ${BOOTSTRAP_SOURCE:-./retool}:/usr/local/retool-repo

jobs-runner:
build:
context: ./
dockerfile: Dockerfile
env_file: ./docker.env
environment:
- DEPLOYMENT_TEMPLATE_TYPE=docker-compose
- SERVICE_TYPE=JOBS_RUNNER
networks:
- backend-network
depends_on:
- postgres
command: bash -c "chmod -R +x ./docker_scripts; sync; ./docker_scripts/wait-for-it.sh postgres:5432; ./docker_scripts/start_api.sh"
links:
- postgres
volumes:
- ./keys:/root/.ssh

workflows-worker:
build:
context: ./
dockerfile: Dockerfile
command: bash -c "./docker_scripts/wait-for-it.sh postgres:5432; ./docker_scripts/start_api.sh"
env_file: ./docker.env
environment:
- DEPLOYMENT_TEMPLATE_TYPE=docker-compose
- SERVICE_TYPE=WORKFLOW_TEMPORAL_WORKER
- NODE_OPTIONS=--max_old_space_size=1024
- DISABLE_DATABASE_MIGRATIONS=true
- WORKFLOW_BACKEND_HOST=http://workflows-backend:3000
- CODE_EXECUTOR_INGRESS_DOMAIN=http://code-executor:3004
# If using Retool-managed Temporal cluster, leave workflow-related ENV vars commented
# If using self-managed cluster (Temporal Cloud or self-hosted) external to your Retool deployment, uncomment and update workflow-related ENV vars
# Compare deployment options here: https://docs.retool/self-hosted/concepts/temporal#compare-options
# - WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_HOST=temporal
# - WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_PORT=7233
# set these if using self-managed Temporal Cloud or require TLS for your self-managed Temporal cluster
# - WORKFLOW_TEMPORAL_TLS_ENABLED
# - WORKFLOW_TEMPORAL_TLS_CRT
# - WORKFLOW_TEMPORAL_TLS_KEY
networks:
- backend-network
- code-executor-network
restart: on-failure

workflows-backend:
build:
context: ./
dockerfile: Dockerfile
env_file: ./docker.env
environment:
- DEPLOYMENT_TEMPLATE_TYPE=docker-compose
- SERVICE_TYPE=WORKFLOW_BACKEND,DB_CONNECTOR,DB_SSH_CONNECTOR
- WORKFLOW_BACKEND_HOST=http://workflows-backend:3000
- CODE_EXECUTOR_INGRESS_DOMAIN=http://code-executor:3004
- DBCONNECTOR_POSTGRES_POOL_MAX_SIZE=100
- DBCONNECTOR_QUERY_TIMEOUT_MS=120000
networks:
- backend-network
- code-executor-network
depends_on:
- postgres
- retooldb-postgres
- workflows-worker
- code-executor
command: bash -c "./docker_scripts/wait-for-it.sh postgres:5432; ./docker_scripts/start_api.sh"
links:
- postgres
restart: on-failure
volumes:
- ./keys:/root/.ssh
- ./keys:/retool_backend/keys
- ssh:/retool_backend/autogen_ssh_keys
- ./retool:/usr/local/retool-git-repo
- ${BOOTSTRAP_SOURCE:-./retool}:/usr/local/retool-repo

code-executor:
build:
context: ./
dockerfile: CodeExecutor.Dockerfile
command: bash -c "./start.sh"
env_file: ./docker.env
environment:
- DEPLOYMENT_TEMPLATE_TYPE=docker-compose
- NODE_OPTIONS=--max_old_space_size=1024
networks:
- code-executor-network
# code-executor uses nsjail to sandbox code execution. nsjail requires
# privileged container access.
# If your deployment does not support privileged access, you can set this
# to false to not use nsjail. Without nsjail, all code is run without
# sandboxing within your deployment.
privileged: true
restart: on-failure

Retool's storage database. See these docs to migrate to an externally hosted database: https://docs.retool.com/docs/configuring-retools-storage-database

postgres:
image: "postgres:11.13"
env_file: docker.env
networks:
- backend-network
volumes:
- data:/var/lib/postgresql/data

retooldb-postgres:
image: "postgres:14.3"
env_file: retooldb.env
networks:
- backend-network
volumes:
- retooldb-data:/var/lib/postgresql/data

Not required, but leave this container to use nginx for handling the frontend & SSL certification

https-portal:
image: tryretool/https-portal:latest
ports:
- "80:80"
- "443:443"
links:
- api
restart: always
env_file: ./docker.env
environment:
STAGE: "production" # <- Change 'local' to 'production' to use a LetsEncrypt signed SSL cert
CLIENT_MAX_BODY_SIZE: 40M
KEEPALIVE_TIMEOUT: 605
PROXY_CONNECT_TIMEOUT: 600
PROXY_SEND_TIMEOUT: 600
PROXY_READ_TIMEOUT: 600
networks:
- frontend-network

networks:
frontend-network:
backend-network:
code-executor-network:

volumes:
ssh:
data:
retooldb-data:

I fixed this by using a fresh install and externalizing the database to an RDS Postgres instance with this documentation.

Worth noting I ran into another issue where the retool docker containers could not connect to the RDS Postgres db.
"PostgreSQL 14.x and above major version use a newer JDBC driver which introduced the 'scram-sha-256' algorithm. This causes connections to fail due to a change made to the password authentication method used in the newer versions of PostgreSQL (scram-sha-256) whereby the client driver you are using only supports connecting via md5 passwords."

I had to add a parameter group to the RDS db instance to set rds.force_ssl = 0 and password_encryption = md5. I found that solution here

Thanks @matthewej for sharing your solution!