Scheduling failure: Workflows failing to run on Temporal cloud

Summary: Every workflow run fails to run with "Scheduling failure"
after 10 seconds.

I have a self-hosted (docker-compose) retool organisation, and a subscription to temporal cloud.

My suspicion is that the mTLS settings are not correct.

I have followed temporal's instructions on generating certs, giving me:

  • ca.crt
  • ca.key
  • client-and-worker.crt
  • client-and-worker.key

I have set the following env vars in ALL containers (to eliminate uncertainty):

  • WORKFLOW_TEMPORAL_SERVER_ROOT_CA_CRT β†’ base64 encoded ca.crt
  • WORKFLOW_TEMPORAL_TLS_CRT β†’ base64 encoded client-and-worker.crt
  • WORKFLOW_TEMPORAL_TLS_KEY β†’ base64 encoded client-and-worker.key
  • WORKFLOW_TEMPORAL_CLUSTER_NAMESPACE=xxx.yyy (my namespace)
  • WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_HOST=xxx.yyy.tmprl.cloud
  • WORKFLOW_TEMPORAL_CLUSTER_FRONTEND_PORT=7233
  • WORKFLOW_TEMPORAL_TLS_ENABLED=true

My namespace (in temporal) has the full raw ca.crt in its CA section.

I can successfully connect / telnet from the workflows-worker container to the tmprl.cloud frontend-host on port 7233

Ultimately, I cannot get any combination of vars that will allow me to actually run a workflow. Every time I try, I end up with a timeout - visible in the UI (see attached) and in the logs for the workflows-worker container.

​Ran into error when scheduling UpdateTimedOutWorkflows Error: Failed to connect before the deadline

I suspect this is down to mTLS issues (wrong key format / wrong key type / wrong encoding - etc)

Finally: I have spun up a docker ubuntu container on the same box and run the temporal hello-world with the same values. This successfully connects a worker.

Could anybody shed any light on what might be going on here?

1 Like

UPDATE: I've managed to "fix" this - unfortunately i didn't do it particularly scientifically. All info included below.

Ok - after piecing together info from various sources, I had a pretty clear picture that this was mTLS related - in particular by enabling:

  • GRPC_VERBOSITY=DEBUG
  • GRPC_TRACE=all

I could see that requests to my temporal cluster were terminating with "connection closed with error unable to get local issuer certificate"

So I went back to the temporal docs, in particular this page: Certificate management - Temporal Cloud feature guide | Temporal Platform Documentation

That page contains two options for generating certs - 1 using tcld and 2 using certstrap

However - the tcld certs don't seem to work when provided as values for CRT/KEY env vars. They also seem to be much smaller (byte-size) than the certstrap certs (tcld gives an ecdsa 384 key, certstrap is 2048 rsa)

Also - the tcld certs seem to have the same DN data between CA and client - but this is explicitly called out as an issue in the docs. When calling certstrap, i included --ou retool when requesting the cert

Also (again, unscientific - ymmv), the certstrap instructions on that page suggest running the following:
openssl pkcs8 -topk8 -inform PEM -outform PEM -in <out/your-namespace.key> -out <out/your-namespace.pkcs8.key> -nocrypt
to generate a pkcs8 key instead of pkcs1 - which i used this time (successfully)

In short, the things that worked:

  • used certstrap not tcld
  • added --ou retool to the ./certstrap request-cert call
  • made the expiry of the client cert strictly less than the expiry of the CA cert
  • converted the client crt to pkcs8
  • converted both key and crt to base64 (cat | base64 -w0) and set them to the two keys:
    • WORKFLOW_TEMPORAL_TLS_CRT
    • WORKFLOW_TEMPORAL_TLS_KEY

I deleted the WORKFLOW_TEMPORAL_SERVER_ROOT_CA_CRT env var. I think this was a complete red herring - it's documented over here but the documentation is ambiguous as to whether it's required when using temporal cloud

I now have successfully running workflows:

I hope this helps somebody