RetoolRPC polling architecture doesn't scale with multiple resources β€” 503 errors and exponential backoff

We're using retoolrpc@0.1.7 (Node.js) to connect our backend to multiple Retool apps. Each app has its own RPC resource, and we currently run 13 resources in a single Node.js process deployed on Kubernetes.

Each RPC resource creates its own agent that continuously polls popQuery every 1 second. With 13 resources across 2 pods, we're making 26 HTTP requests per second to Retool's server β€” just for polling, even with zero user activity. That's over 2 million polling calls per day.

As we add more Retool apps (and therefore more RPC resources), this grows linearly. At our previous count of 25 resources (before consolidation), we were hitting 50 requests/sec or 4.3 million calls/day.

Symptoms we're seeing

  1. 503 (Service Unavailable) responses from the popQuery endpoint β€” Retool's server appears to be rate-limiting or struggling with the polling volume from our agents.

  2. Exponential backoff spiral β€” When popQuery returns a 503, the library's loopWithBackoff catches the error and enters exponential backoff (50ms β†’ 100ms β†’ ... up to 10 minutes). During backoff, the agent stops polling and any RPC calls from Retool go undelivered.

  3. pollingTimeoutMs (default 5s) compounds the problem β€” The popQuery call is a long-poll that intentionally holds the connection open. The default 5s timeout is shorter than the typical hold time, causing false timeouts that trigger additional backoff on top of the 503-induced backoff. We resolved this specific issue by increasing pollingTimeoutMs to 30s, but the 503s remain.

  4. Intermittent 120s RPC timeouts β€” Users see random RPC calls that never reach our server. These correlate with agents being stuck in backoff from the 503/timeout errors above.

What we've done on our side

  • Increased pollingTimeoutMs from 5s to 30s β€” eliminated false timeout errors

  • Removed unused legacy resources β€” reduced from 25 to 13 agents

  • Planning to increase pollingIntervalMs from 1s to 3s

These mitigations help, but the fundamental issue remains: every new RPC resource adds another always-on polling loop, and the architecture doesn't scale beyond a certain number of resources.

Questions for the Retool team

  1. Is there a rate limit on the popQuery endpoint? If so, what is it (per API token, per account, per IP)? Knowing the limit would help us plan capacity.

  2. Is there a push-based delivery model planned? (WebSocket, webhooks, server-sent events) β€” where Retool notifies our server when a query arrives, rather than our server polling every second. This would eliminate the scaling issue entirely.

  3. Can a single agent poll for multiple resource IDs in one request? Currently each resource requires its own popQuery call. If the API supported batching (e.g., "give me the next query for any of these 13 resources"), we could reduce 13 polls/sec to 1.

  4. Is there guidance on the recommended maximum number of RPC resources per server/account? We couldn't find scaling documentation for RPC.

    If you have any other suggestions on which resource types suits for our requirement, please let us know