Azure CosmosDB "pipeline not supported", MongoDB Atlas $lookup exceeds 16793600 bytes and Retool API request timeout is 2 minutes... What are my options?

A bit of background:

I have the following data structure in my MongoDB database:

sites
pages
mediablocks

Each site document (in the site collection) has a pages property which is an array that contains ObjectIds.
Example:

sites: [{
  _id: ObjectId('1234567890'),
  ...
  pages: [ 
    ObjectId('5678901234'),
    ObjectId('6789012345')
    ...
  ]
}]

The same applies for pages (pages collection) and mediaBlocks (in reference to the mediablocks collection):

pages: [{
  _id: ObjectId('1234567890'),
  ...
  mediaBlocks: [ 
    ObjectId('5678901234'),
    ObjectId('6789012345')
    ...
  ]
}]

(this is described in my mongoplayground link below)

I wish to perform an aggregation to return all of the documents with nested information.

My aggregation looks like this:

db.sites.aggregate([
  {
    $lookup: {
      from: "pages",
      localField: "pages",
      foreignField: "_id",
      as: "pages",
      pipeline: [
        {
          $lookup: {
            from: "mediablocks",
            localField: "mediaBlocks",
            foreignField: "_id",
            as: "mediaBlocks"
          }
        }
      ]
    }
  }
])

The output looks like this:

[
  {
    "_id": ObjectId("5e607b4743e2640056467509"),
    "createdAt": ISODate("2020-03-05T04:08:39.365Z"),
    "pages": [
      {
        "_id": ObjectId("5de543144a90390055a71df7"),
        "createdAt": ISODate("2022-05-11T21:03:06.085Z"),
        "mediaBlocks": [
          {
            "_id": ObjectId("60a5e0e89030f50037cb6671"),
            "createdAt": ISODate("2019-11-11T13:32:35.826Z"),
            "normalizedText": "some text from example.com",
            "rawText": "Some text from example.com",
            "status": "translated",
            "updatedAt": ISODate("2019-11-20T12:27:36.156Z")
          }
        ],
        "title": "Some page on Example.com",
        "translated": true,
        "updatedAt": ISODate("2022-07-28T04:16:52.449Z"),
        "uri": "http://example.com/somepage"
      }
    ],
    "title": "Example.com",
    "updatedAt": ISODate("2022-07-08T07:02:57.867Z"),
    "url": "http://example.com"
  }
]

I eventually want to return a high-level outlook of the data, not see all of the fields etc.

An example of the data I'd like to see is:

[
    {
        "totalMediablocks": 208762,
        "translatedMediablocks": 89550,
        "untranslatedMediablocks": 8,
        "irrelevantMediablocks": 1106,
        "inReviewMediablocks": 4,
        "processingMediablocks": 0,
        "translatedMediablockWords": 574748,
        "untranslatedMediablockWords": 143,
        "uniqueTotalMediablocks": 16712,
        "uniqueTranslatedMediablocks": 16563,
        "uniqueUntranslatedMediablocks": 6,
        "uniqueIrrelevantMediablocks": 139,
        "uniqueInReviewMediablocks": 4,
        "uniqueProcessingMediablocks": 0,
        "uniqueTranslatedMediablockWords": 243958,
        "uniqueUntranslatedMediablockWords": 101,
        "url": "https://www.example.com",
        "numberOfPages": 221
    },
    ...

Where: totalMediablocks is the count of the number of mediaBlocks that belong to every page of a site and so on...

I was hoping to build the above after fetching all of the data but nonetheless, my aggregation works perfectly fine (in Mongo Playground).

When I try to run the same aggregation in Retool, I get the following output:

pipeline not supported

I believe this is because my MongoDB Data Source is Azure's CosmosDB (MongoDB API).
The version stated on my Azure Portal is 4.2.

When I try to use a copy of my data from MongoDB Atlas, I get a different error:

PlanExecutor error during aggregation :: caused by :: Total size of documents in pages matching pipeline's $lookup stage exceeds 16793600 bytes

The data is just too big!

I attempted to rectify this by creating an API endpoint which processes the same information programatically (fetches all the sites then for each site, send off a request for all of the pages and so on), but that takes just under 3 minutes to return the output I desire... and the default timeout on Retool for queries is 2 minutes (120 seconds).

What are my options here? How should I proceed?

Hey @krynv!

Oh great questions. Two quick questions before we dive deeper.

  1. Would pagination work for you?
    https://docs.retool.com/docs/data-in-tables#pagination
    http://community.retool.com/t/mongodb-server-side-pagination/5144\

  2. To clarify, if you just run a simple query, would you be able to write some JS in Retool once we grab all data to do the data restructuring instead of using Mongo methods?

Hi @victoria, thanks for your reply.

I'm not sure pagination is the best approach for the statistics we're after, as the questions we'd like to answer with our data are:

"How many words belong to a specific site?"
In other words, counting individual words in the mediablock.normalizedText property.

"How many mediablocks have the status of 'translated' for site X?"
In other words, counting the number of mediablock objects which have a status of 'translated', that belong to every page on that site.

Using pagination, we won't be able to get all of the data we need in order to answer the aforementioned questions, we'd only be able to look at the first X results and show the statistics on those, as opposed to the full picture. I hope this makes sense?

To answer your second question, we're more than capable of doing the processing client-side (via Retool). That's not an issue at all.

I suppose one way to solve this would be to have a query for every metric we're looking for.
I.E.

  • Count all mediablocks for all sites (return the result)
  • Count all translated mediablocks for all sites (return the result)
  • count all untranslated mediablocks for all sites (return the result)
    ...
  • count number of pages for all sites (return the result)

Then combine those into one big object for table output, using JS on Retool.
Is that what you're suggesting?

That would probably help to reduce latency! Anything that involves pulling a bunch of data slows apps down, generally. If something like that would work for you, let me know and I'd be happy to help with the JS if needed! :slight_smile: