Retool Failing - Docker instance Disk Space and DB crash

Hi, I need some help please. I'm having difficulty keeping my retool server alive. I have numerous scheduled tasks etc running to clear out logs etc just to try keep the disk space under control but overnight lost 49GB for no reason and server died. Let me share:

  1. The big problem area is /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384
    This folder seems to be related to an active volume retool-onpremise-master_data. The directory has numerous 1GB files (32x), named 16690, 16690.1, 16690.1.1 etc Each one of these 1GB in size. Last night the 16690.1, 16690.2 all duplicated and became 16690.1.1 16690.2.1 as example and doubled in space. If I dare delete any of these files docker doesn't start up. None of the prune tasks clears out this data. I've tried 100 different mechanisms and everytime fails. Note, I use an external database for all my data. Only retool application data is stored on prep in postgres.

  2. Yesterday my problem was /var/lib/docker/overlay2. This directory suddenly also jumped in size from 3GB to 28GB. I ran all the possible prune tasks and no space was regained. Ultimately a docker-compose down and docker-compose up -d cleared the data.

I now have a docker-compose down and up scheduled task daily with a retool log clear and all the prune's hourly to mitigate the second issue but, this isn't enough as the box still runs out of disk space no matter how much space I allocate. Frustrating part is last nights cron to stop and start docker appears to have triggered the issue seen on issue 1 today.

to add... My docker-compose.yml does have this for each service
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"

I've also set the following environmental variables:
DISABLE_MEMORY_AND_CPU_USAGE_LOGGING=true
LOG_LEVEL=info
DEBUG=0
DISABLE_AUDIT_TRAILS_LOGGING=true

I hope this assists, it seems by running the below command all of the files in question are actually in use.
lsof +D /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384

It does appear to be postgres related? If so is there some way to optimise, clear, compress this data?

Below just some examples:
ubuntu@auto-retail:~/retool-onpremise-master$ sudo lsof +D /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
postgres 12169 lxd 4u REG 259,1 8192 855696 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2601
postgres 12169 lxd 5u REG 259,1 122880 855668 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2663
postgres 12169 lxd 7u REG 259,1 1155072 855760 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/1249
postgres 12169 lxd 12u REG 259,1 16384 855800 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2656
postgres 12169 lxd 13u REG 259,1 114688 855851 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2604
postgres 12169 lxd 14u REG 259,1 106496 855695 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2701
postgres 12169 lxd 15u REG 259,1 188416 855795 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2620
postgres 12169 lxd 16u REG 259,1 16384 855619 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2605
postgres 12169 lxd 17u REG 259,1 8192 855929 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/16931
postgres 12169 lxd 18u REG 259,1 139264 855863 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2610
postgres 12169 lxd 19u REG 259,1 16384 855932 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/16937
postgres 12169 lxd 20u REG 259,1 196608 855685 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/2659
postgres 12169 lxd 21u REG 259,1 16384 855933 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/16938
postgres 12169 lxd 22u REG 259,1 16384 855934 /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384/16939

Another update, no changes made in app... server was idling and suddenly disk space shot up.

76G /var
72G /var/lib
71G /var/lib/docker
44G /var/lib/docker/overlay2
27G /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384
27G /var/lib/docker/volumes/retool-onpremise-master_data/_data/base
27G /var/lib/docker/volumes/retool-onpremise-master_data/_data
27G /var/lib/docker/volumes/retool-onpremise-master_data
27G /var/lib/docker/volumes
7.6G /var/lib/docker/overlay2/d60226d82459a04e6d945191c988e7702487a09f2683ab1ae31e9240fdf2274e
7.6G /var/lib/docker/overlay2/84e1dbbed38c3582ca3846e01654e84579f8bd0f22f8d790941b54ad47bab99f
7.6G /var/lib/docker/overlay2/82488e4961e5d2cbf5cb6d7ef67c27ed08ff0c8d7851aa683ac721b2a2b6fb17
7.6G /var/lib/docker/overlay2/4fd8f2121bbbd9badad1285357f85c31bc3bb12e57b740221bdee2a0a09092de
7.6G /var/lib/docker/overlay2/3af23bb68731052b1004f04aac91411a58a218a631c9a7dfddeba1b439884a45
3.8G /var/log
3.8G /var/lib/docker/overlay2/d60226d82459a04e6d945191c988e7702487a09f2683ab1ae31e9240fdf2274e/merged
3.8G /var/lib/docker/overlay2/d60226d82459a04e6d945191c988e7702487a09f2683ab1ae31e9240fdf2274e/diff
3.8G /var/lib/docker/overlay2/84e1dbbed38c3582ca3846e01654e84579f8bd0f22f8d790941b54ad47bab99f/merged
3.8G /var/lib/docker/overlay2/84e1dbbed38c3582ca3846e01654e84579f8bd0f22f8d790941b54ad47bab99f/diff
3.8G /var/lib/docker/overlay2/82488e4961e5d2cbf5cb6d7ef67c27ed08ff0c8d7851aa683ac721b2a2b6fb17/merged

I downed docker and up again.

57G /var
53G /var/lib
51G /var/lib/docker
27G /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384
27G /var/lib/docker/volumes/retool-onpremise-master_data/_data/base
27G /var/lib/docker/volumes/retool-onpremise-master_data/_data
27G /var/lib/docker/volumes/retool-onpremise-master_data
27G /var/lib/docker/volumes
25G /var/lib/docker/overlay2
3.8G /var/log
3.8G /var/lib/docker/overlay2/f2d777def1cb2b2ac02e04b817800235c4b3e2fe6e6efbfad3e03c438c5a9304
3.8G /var/lib/docker/overlay2/27fd517b20ab153d76184396c2c07bb3c70359d814411d37d70d877dcf749541
3.8G /var/lib/docker/overlay2/19b8aa991857cb86c52fb2517d963139969f644935b17207aff2922efa015848
3.8G /var/lib/docker/overlay2/0ece15f62bd85df3b6552262f05179a689294a701f27af850c66c0ceba543737
3.8G /var/lib/docker/overlay2/0964c69d749e984977075d2102f7dcee1d9a306affd22d2312c36e98278a5513
3.7G /var/lib/docker/overlay2/f2d777def1cb2b2ac02e04b817800235c4b3e2fe6e6efbfad3e03c438c5a9304/merged
3.7G /var/lib/docker/overlay2/27fd517b20ab153d76184396c2c07bb3c70359d814411d37d70d877dcf749541/merged
3.7G /var/lib/docker/overlay2/19b8aa991857cb86c52fb2517d963139969f644935b17207aff2922efa015848/merged
3.7G /var/lib/docker/overlay2/0ece15f62bd85df3b6552262f05179a689294a701f27af850c66c0ceba543737/merged
3.7G /var/lib/docker/overlay2/0964c69d749e984977075d2102f7dcee1d9a306affd22d2312c36e98278a5513/merged

I need help here, this isn't normal I'm sure. Why are these containers doubling down out of the blue

New frustrating problem, after running perfectly fine Retool just stopped. Seeing this from the docker logs:

workflows-worker_1 | original: error: "base/16384" is not a valid data directory
api_1 | parent: error: "base/16384" is not a valid data directory
postgres_1 | 2024-08-13 22:35:24.561 UTC [80] FATAL: "base/16384" is not a valid data directory
postgres_1 | 2024-08-13 22:35:24.561 UTC [80] DETAIL: File "base/16384/PG_VERSION" does not contain valid data.
postgres_1 | 2024-08-13 22:35:24.561 UTC [80] HINT: You might need to initdb.

I did make an app change, as a result it seems the posgre db crashed. I need some help repairing this mess.

Hey @kowuspelser - thanks for reaching out and for your patience. I can certainly understand any frustration you might be feeling, but hopefully we can figure this out. :+1:

A couple clarifying questions to help me understand your architecture:

  • Which version of Retool are you currently running?
  • Is it running locally or on cloud infrastructure? If the latter, which platform?
  • What does the command docker info return?
  • What does the command docker-compose -v return?
  • Are you able to share your docker-compose.yml here?

I'm fairly sure that this is an environment issue with the docker engine and not necessarily a Retool issue, but let's dig into it.

  • Which version of Retool are you currently running? - Retool version 3.52.16

  • Is it running locally or on cloud infrastructure? If the latter, which platform? - Onprem on Ubuntu

  • What does the command docker info return?

Client: Docker Engine - Community
Version: 27.0.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.15.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.28.1
Path: /usr/libexec/docker/cli-plugins/docker-compose

Server:
Containers: 8
Running: 8
Paused: 0
Stopped: 0
Images: 5
Server Version: 27.0.3
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 2bf793ef6dc9a18e00cb12efb64355c2c9d5eb41
runc version: v1.1.13-0-g58aa920
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.5.0-1023-aws
Operating System: Ubuntu 22.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.43GiB
Name: auto-retail
ID: 25ee7eaf-bcf2-4296-8f4c-b35c778ed30e
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

  • What does the command docker-compose -v return?

docker-compose version 1.29.0, build 07737305

  • Are you able to share your docker-compose.yml here?
version: '2'
services:
  api:
    build:
      context: ./
      dockerfile: Dockerfile
    env_file: ./docker.env
    environment:
      - SERVICE_TYPE=MAIN_BACKEND
      - DB_CONNECTOR_HOST=http://db-connector
      - DB_CONNECTOR_PORT=3002
      - DB_SSH_CONNECTOR_HOST=http://db-ssh-connector
      - DB_SSH_CONNECTOR_PORT=3002
      - WORKFLOW_BACKEND_HOST=http://api:3000

    networks:
      - frontend-network
      - backend-network
      - db-connector-network
      - db-ssh-connector-network
      - workflows-network
    depends_on:
      - postgres
      - retooldb-postgres
      - db-connector
      - db-ssh-connector
      - jobs-runner
      - workflows-worker
    command: bash -c "./docker_scripts/wait-for-it.sh postgres:5432; ./docker_scripts/start_api.sh"
    links:
      - postgres
    ports:
      - '3000:3000'
    restart: on-failure
    volumes:
      - ./keys:/root/.ssh
      - ssh:/retool_backend/autogen_ssh_keys
      - ./retool:/usr/local/retool-git-repo
      - ${BOOTSTRAP_SOURCE:-./retool}:/usr/local/retool-repo

  jobs-runner:
    build:
      context: ./
      dockerfile: Dockerfile
    env_file: ./docker.env
    environment:
      - SERVICE_TYPE=JOBS_RUNNER
    networks:
      - backend-network
    depends_on:
      - postgres
    command: bash -c "chmod -R +x ./docker_scripts; sync; ./docker_scripts/wait-for-it.sh postgres:5432; ./docker_scripts/start_api.sh"
    links:
      - postgres
    volumes:
      - ./keys:/root/.ssh
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  db-connector:
    build:
      context: ./
      dockerfile: Dockerfile
    env_file: ./docker.env
    environment:
      - SERVICE_TYPE=DB_CONNECTOR_SERVICE
      - DBCONNECTOR_POSTGRES_POOL_MAX_SIZE=100
      - DBCONNECTOR_QUERY_TIMEOUT_MS=120000
    networks:
      - db-connector-network
    restart: on-failure

  db-ssh-connector:
    build:
      context: ./
      dockerfile: Dockerfile
    command: bash -c "./docker_scripts/generate_key_pair.sh; ./docker_scripts/start_api.sh"
    env_file: ./docker.env
    environment:
      - SERVICE_TYPE=DB_SSH_CONNECTOR_SERVICE
      - DBCONNECTOR_POSTGRES_POOL_MAX_SIZE=100
      - DBCONNECTOR_QUERY_TIMEOUT_MS=120000
    networks:
      - db-ssh-connector-network
    volumes:
      - ./docker_scripts:/app/docker_scripts
      - ssh:/retool_backend/autogen_ssh_keys
      - ./keys:/retool_backend/keys
    restart: on-failure

  workflows-worker:
    build:
      context: ./
      dockerfile: Dockerfile
    command: bash -c "./docker_scripts/wait-for-it.sh postgres:5432; ./docker_scripts/start_api.sh"
    env_file: ./docker.env
    environment:
      - SERVICE_TYPE=WORKFLOW_TEMPORAL_WORKER
      - NODE_OPTIONS=--max_old_space_size=1024
      - DISABLE_DATABASE_MIGRATIONS=true

    networks:
      - backend-network
      - db-connector-network
      - workflows-network
    restart: on-failure

  postgres:
    image: 'postgres:11.13'
    env_file: docker.env
    networks:
      - backend-network
      - db-connector-network
    volumes:
      - data:/var/lib/postgresql/data

  retooldb-postgres:
    image: "postgres:14.3"
    env_file: retooldb.env
    networks:
      - backend-network
      - db-connector-network
    volumes:
      - retooldb-data:/var/lib/postgresql/data

  nginx:
    image: nginx:latest
    extra_hosts:
    - "host.docker.internal:host-gateway"
    ports:
      - '80:80'
      - '443:443'
    command: [nginx-debug, "-g", "daemon off;"] 
    volumes:
      - ./nginx:/etc/nginx/conf.d
      - /etc/nginx/certs:/etc/nginx/certs

    links:
      - api
    depends_on:
      - api
    restart: always
    env_file: ./docker.env
    environment:
      STAGE: 'production' 
      CLIENT_MAX_BODY_SIZE: 40M
      KEEPALIVE_TIMEOUT: 605
      PROXY_CONNECT_TIMEOUT: 600
      PROXY_SEND_TIMEOUT: 600
      PROXY_READ_TIMEOUT: 600
    networks:
      - frontend-network

networks:
  frontend-network:
    driver: bridge
  backend-network:
  db-connector-network:
  db-ssh-connector-network:
  workflows-network:

volumes:
  ssh:
  data:
  retooldb-data:

1 Like

Thanks for sharing this information! One of the things that jumps out as potentially problematic is the version mismatch between the local docker-compose version and the one packaged with Docker Engine. The next time you bring your containers up, try the command docker compose up -d. This should ensure that we're using the more recent version.

To get a little more insight into which tables are growing so large, can you connect to your postgres container and execute the following query:

SELECT
  schemaname AS schema_name,
  tablename AS table_name,
  pg_size_pretty(pg_total_relation_size(quote_ident(schemaname) || '.' || quote_ident(tablename))) AS total_size
FROM
  pg_tables
WHERE 
  schemaname = 'public'
ORDER BY
  pg_total_relation_size(quote_ident(schemaname) || '.' || quote_ident(tablename)) 
DESC
LIMIT
  10

Thanks!