Scheduling failure: Workflows failing to run on Temporal cloud

Ben01 · October 15, 2024, 5:32pm

Summary: Every workflow run fails to run with "Scheduling failure"
after 10 seconds.

I have a self-hosted (docker-compose) retool organisation, and a subscription to temporal cloud.

My suspicion is that the mTLS settings are not correct.

I have followed temporal's instructions on generating certs, giving me:

ca.crt
ca.key
client-and-worker.crt
client-and-worker.key

I have set the following env vars in ALL containers (to eliminate uncertainty):

WORKFLOW_TEMPORAL_SERVER_ROOT_CA_CRT → base64 encoded ca.crt
WORKFLOW_TEMPORAL_TLS_CRT → base64 encoded client-and-worker.crt
WORKFLOW_TEMPORAL_TLS_KEY → base64 encoded client-and-worker.key
WORKFLOW_TEMPORAL_CLUSTER_NAMESPACE=xxx.yyy (my namespace)
WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_HOST=xxx.yyy.tmprl.cloud
WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_PORT=7233
WORKFLOW_TEMPORAL_TLS_ENABLED=true

My namespace (in temporal) has the full raw ca.crt in its CA section.

I can successfully connect / telnet from the workflows-worker container to the tmprl.cloud frontend-host on port 7233

Ultimately, I cannot get any combination of vars that will allow me to actually run a workflow. Every time I try, I end up with a timeout - visible in the UI (see attached) and in the logs for the workflows-worker container.

Ran into error when scheduling UpdateTimedOutWorkflows Error: Failed to connect before the deadline

I suspect this is down to mTLS issues (wrong key format / wrong key type / wrong encoding - etc)

Finally: I have spun up a docker ubuntu container on the same box and run the temporal hello-world with the same values. This successfully connects a worker.

Could anybody shed any light on what might be going on here?

Ben01 · October 16, 2024, 12:18pm

UPDATE: I've managed to "fix" this - unfortunately i didn't do it particularly scientifically. All info included below.

Ok - after piecing together info from various sources, I had a pretty clear picture that this was mTLS related - in particular by enabling:

GRPC_VERBOSITY=DEBUG
GRPC_TRACE=all

I could see that requests to my temporal cluster were terminating with "connection closed with error unable to get local issuer certificate"

So I went back to the temporal docs, in particular this page: Certificate management - Temporal Cloud feature guide | Temporal Platform Documentation

That page contains two options for generating certs - 1 using tcld and 2 using certstrap

However - the tcld certs don't seem to work when provided as values for CRT/KEY env vars. They also seem to be much smaller (byte-size) than the certstrap certs (tcld gives an ecdsa 384 key, certstrap is 2048 rsa)

Also - the tcld certs seem to have the same DN data between CA and client - but this is explicitly called out as an issue in the docs. When calling certstrap, i included --ou retool when requesting the cert

Also (again, unscientific - ymmv), the certstrap instructions on that page suggest running the following:
openssl pkcs8 -topk8 -inform PEM -outform PEM -in <out/your-namespace.key> -out <out/your-namespace.pkcs8.key> -nocrypt
to generate a pkcs8 key instead of pkcs1 - which i used this time (successfully)

In short, the things that worked:

used certstrap not tcld
added --ou retool to the ./certstrap request-cert call
made the expiry of the client cert strictly less than the expiry of the CA cert
converted the client crt to pkcs8
converted both key and crt to base64 (cat | base64 -w0) and set them to the two keys:
- WORKFLOW_TEMPORAL_TLS_CRT
- WORKFLOW_TEMPORAL_TLS_KEY

I deleted the WORKFLOW_TEMPORAL_SERVER_ROOT_CA_CRT env var. I think this was a complete red herring - it's documented over here but the documentation is ambiguous as to whether it's required when using temporal cloud

I now have successfully running workflows:

I hope this helps somebody