OpenAPI Intermittent Operation Not Found - Possible Caching Issue

OpenAPI Specification Caching Issue in Published Retool Apps

Goal

We're trying to reliably access new OpenAPI operations in published Retool apps. The expected behavior is that when we add a new operation to a query and publish our app, the published version should consistently recognize and execute that operation. Currently, we're experiencing a critical issue where published apps intermittently fail to recognize operations that were properly configured before publishing, returning "operation not found" errors for several hours after publishing.

Steps

  1. We've observed this issue in our production environment when deploying new versions of our Golang Huma service running on GCP.
  2. To systematically investigate and reproduce the issue, we've created a test service that dynamically adds and removes endpoints on a fixed schedule:
  3. The test service adds one new endpoint every hour and maintains a rolling 7-day window (168 hours), removing older endpoints.
  4. Reproduction steps:
    • Add https://dynamic-retool.tylerstiene.ca/openapi.json as an OpenAPI resource in Retool
    • Create a query component that uses one of the operations
    • Wait for a new endpoint to be added (happens hourly)
    • Update the query to use the new operation (may require refreshing the resource first) (maybe seperate bug -> refreshing the schema on the query does not always add the new operation even if the schema shows it - addiitonal screenshot attached).
    • Test the query in the editor to confirm it works
    • Publish the app
    • Access the published version of the app
    • Try to execute the query with the new operation
    • Observe inconsistent behavior: sometimes it works, sometimes it fails with "operation not found"
    • This inconsistent behavior can persist for several hours after publishing
    • By the next day, the issue typically resolves itself
  5. We've captured video evidence of this issue occurring in real-time with a published app.

Details

Our investigation focuses specifically on the published app experience with dynamically changing OpenAPI specs:

  1. Network monitoring and production logs shows some critical differences in behavior:
  • Retool editor: Fetches the OpenAPI spec often when editing.
  • Published app: Does not refetch the spec on operation calls leading us to believe it is using a cached version.
  1. Log analysis and testing reveal a troubling pattern:
  • Operations that work perfectly in the editor may fail in the published app
  • The published app can alternate between successful calls and "operation not found" errors
  • This behavior can persist for hours after publishing
  1. Key characteristics of the issue:

    • Operations configured and verified in the editor fail in published apps
    • Inconsistent behavior: sometimes works, sometimes fails with the exact same configuration
    • The issue is time-dependent, typically resolving itself by the next day
    • This causes production outages when we deploy new API operations and update our Retool apps
  2. Our demo service (https://dynamic-retool.tylerstiene.ca) purposely:

    • Sets proper cache control headers (Cache-Control: no-cache, no-store, must-revalidate)
    • Logs all spec requests for monitoring purposes
    • Generates predictable, time-based endpoints to help reproduce the issue

Screenshots

I am unable to attach files being a new user. screen captures are here https://drive.google.com/drive/folders/1kSI4c6GmZ94eAfDrqcPphQ3aFIM8amBY?usp=drive_link

  1. Demo App showing the query working followed by a 400 errror.
  2. Video showing the query working and then failing in the published app.
  3. Screen shot showing the schema list failing to update in the query editor even though the schema shows the new operation. - Probably a separate bug?

I've recorded a video demonstration that clearly shows this issue occurring in real-time with a published app, which I can share upon request. The video shows how the same operation works in the editor but fails intermittently in the published app after publishing.

App JSON

All you need is a single one button app hooked up to a query that uses the new operation. Here’s a minimal example of the app JSON export.

Impact

This issue is causing significant problems for our development workflow and user experience:

  1. Production Outages: When we deploy new API operations and update our Retool apps to use them, published apps fail unpredictably for hours.

  2. Developer Frustration: The operations work perfectly in the editor, pass all tests, but then fail in production with no clear pattern.

  3. User Confusion: End users receive "operation not found" errors for features that should be working, with no apparent solution other than waiting.

  4. Deployment Delays: We're forced to deploy backend API changes at least 24 hours before updating Retool apps to ensure stability.

4 Likes

Hello @Tyler_Stiene,

Apologies for this issue, we have built out a reproduction app and are waiting for a new operation to be added at the top of the hour to confirm that the new operation will work in the app editor but produce and error in a live published app.

Thank you so much for the detailed write up, providing the steps to reproduce and setting everything up. The Retool team and I greatly appreciate the attention to detail.

I should be able to elevate this to our core-data engineering team manager as soon as I get the repro app confirming the behavior for them and their team to investigate further and hopefully resolve shortly.

I am super confused as to how this error would pop up inconsistently :thinking: I would think that it would fail repeatedly and then when a new cache is created in would have the new operation added to the list.

My team and I were going through our code base on the OpenAPI DB connector and did find a cache for the options spec list with a 10 min variable for TTL, so I have a hunch that may be the culprit.

Will report back with more details shortly!

Welcome to the community, @Tyler_Stiene!

The root cause of this wasn't immediately obvious, but it's likely tied to the way swagger clients are being cached on individual backend servers. There are a couple of things we can do to improve this, but we don't yet have a concrete implementation timeline. As soon as that changes, I can provide an update here!

Hey everyone,

We ran into the exact same intermittent error in Retool after upgrading to @nestjs/swagger v8.0.0 (on NestJS v11). It works properly from any HTTP client, but in Retool our endpoints are randomly throwing "Operation XXX not found."

Appreciate any tips!

1 Like

Thanks for reaching out, Omar! Welcome to the community. :slightly_smiling_face:

Do you primarily notice this after publishing changes to your API spec? As mentioned previously, the issue is likely tied to inconsistent caching across different pieces of our infrastructure. We don't yet have a timeline for the planned improvements, but I'll keep you in the loop.

If you have particularly critical endpoints, it might make sense to refactor them as REST API requests, instead.

Thanks Darren :pray:

Actually we haven't changed our specs; we're still on 3.0.0. The issue only started by upgrading NestJS/Swagger from v10 to v11.

Migrating the apps to a REST API resolved it, but it was quite a heavy lift given the number of apps involved.

Hopefully this gets resolved quickly.