Hi, I need some help please. I'm having difficulty keeping my retool server alive. I have numerous scheduled tasks etc running to clear out logs etc just to try keep the disk space under control but overnight lost 49GB for no reason and server died. Let me share:
The big problem area is /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384
This folder seems to be related to an active volume retool-onpremise-master_data. The directory has numerous 1GB files (32x), named 16690, 16690.1, 16690.1.1 etc Each one of these 1GB in size. Last night the 16690.1, 16690.2 all duplicated and became 16690.1.1 16690.2.1 as example and doubled in space. If I dare delete any of these files docker doesn't start up. None of the prune tasks clears out this data. I've tried 100 different mechanisms and everytime fails. Note, I use an external database for all my data. Only retool application data is stored on prep in postgres.
Yesterday my problem was /var/lib/docker/overlay2. This directory suddenly also jumped in size from 3GB to 28GB. I ran all the possible prune tasks and no space was regained. Ultimately a docker-compose down and docker-compose up -d cleared the data.
I now have a docker-compose down and up scheduled task daily with a retool log clear and all the prune's hourly to mitigate the second issue but, this isn't enough as the box still runs out of disk space no matter how much space I allocate. Frustrating part is last nights cron to stop and start docker appears to have triggered the issue seen on issue 1 today.
to add... My docker-compose.yml does have this for each service
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
I've also set the following environmental variables:
DISABLE_MEMORY_AND_CPU_USAGE_LOGGING=true
LOG_LEVEL=info
DEBUG=0
DISABLE_AUDIT_TRAILS_LOGGING=true
I hope this assists, it seems by running the below command all of the files in question are actually in use.
lsof +D /var/lib/docker/volumes/retool-onpremise-master_data/_data/base/16384
It does appear to be postgres related? If so is there some way to optimise, clear, compress this data?
New frustrating problem, after running perfectly fine Retool just stopped. Seeing this from the docker logs:
workflows-worker_1 | original: error: "base/16384" is not a valid data directory
api_1 | parent: error: "base/16384" is not a valid data directory
postgres_1 | 2024-08-13 22:35:24.561 UTC [80] FATAL: "base/16384" is not a valid data directory
postgres_1 | 2024-08-13 22:35:24.561 UTC [80] DETAIL: File "base/16384/PG_VERSION" does not contain valid data.
postgres_1 | 2024-08-13 22:35:24.561 UTC [80] HINT: You might need to initdb.
I did make an app change, as a result it seems the posgre db crashed. I need some help repairing this mess.
Hey @kowuspelser - thanks for reaching out and for your patience. I can certainly understand any frustration you might be feeling, but hopefully we can figure this out.
A couple clarifying questions to help me understand your architecture:
Which version of Retool are you currently running?
Is it running locally or on cloud infrastructure? If the latter, which platform?
What does the command docker info return?
What does the command docker-compose -v return?
Are you able to share your docker-compose.yml here?
I'm fairly sure that this is an environment issue with the docker engine and not necessarily a Retool issue, but let's dig into it.
Thanks for sharing this information! One of the things that jumps out as potentially problematic is the version mismatch between the local docker-compose version and the one packaged with Docker Engine. The next time you bring your containers up, try the command docker compose up -d. This should ensure that we're using the more recent version.
To get a little more insight into which tables are growing so large, can you connect to your postgres container and execute the following query:
SELECT
schemaname AS schema_name,
tablename AS table_name,
pg_size_pretty(pg_total_relation_size(quote_ident(schemaname) || '.' || quote_ident(tablename))) AS total_size
FROM
pg_tables
WHERE
schemaname = 'public'
ORDER BY
pg_total_relation_size(quote_ident(schemaname) || '.' || quote_ident(tablename))
DESC
LIMIT
10