A bit of background:
I have the following data structure in my MongoDB database:
sites
pages
mediablocks
Each site
document (in the site
collection) has a pages
property which is an array that contains ObjectId
s.
Example:
sites: [{
_id: ObjectId('1234567890'),
...
pages: [
ObjectId('5678901234'),
ObjectId('6789012345')
...
]
}]
The same applies for pages
(pages
collection) and mediaBlocks
(in reference to the mediablocks
collection):
pages: [{
_id: ObjectId('1234567890'),
...
mediaBlocks: [
ObjectId('5678901234'),
ObjectId('6789012345')
...
]
}]
(this is described in my mongoplayground link below)
I wish to perform an aggregation to return all of the documents with nested information.
My aggregation looks like this:
db.sites.aggregate([
{
$lookup: {
from: "pages",
localField: "pages",
foreignField: "_id",
as: "pages",
pipeline: [
{
$lookup: {
from: "mediablocks",
localField: "mediaBlocks",
foreignField: "_id",
as: "mediaBlocks"
}
}
]
}
}
])
The output looks like this:
[
{
"_id": ObjectId("5e607b4743e2640056467509"),
"createdAt": ISODate("2020-03-05T04:08:39.365Z"),
"pages": [
{
"_id": ObjectId("5de543144a90390055a71df7"),
"createdAt": ISODate("2022-05-11T21:03:06.085Z"),
"mediaBlocks": [
{
"_id": ObjectId("60a5e0e89030f50037cb6671"),
"createdAt": ISODate("2019-11-11T13:32:35.826Z"),
"normalizedText": "some text from example.com",
"rawText": "Some text from example.com",
"status": "translated",
"updatedAt": ISODate("2019-11-20T12:27:36.156Z")
}
],
"title": "Some page on Example.com",
"translated": true,
"updatedAt": ISODate("2022-07-28T04:16:52.449Z"),
"uri": "http://example.com/somepage"
}
],
"title": "Example.com",
"updatedAt": ISODate("2022-07-08T07:02:57.867Z"),
"url": "http://example.com"
}
]
I eventually want to return a high-level outlook of the data, not see all of the fields etc.
An example of the data I'd like to see is:
[
{
"totalMediablocks": 208762,
"translatedMediablocks": 89550,
"untranslatedMediablocks": 8,
"irrelevantMediablocks": 1106,
"inReviewMediablocks": 4,
"processingMediablocks": 0,
"translatedMediablockWords": 574748,
"untranslatedMediablockWords": 143,
"uniqueTotalMediablocks": 16712,
"uniqueTranslatedMediablocks": 16563,
"uniqueUntranslatedMediablocks": 6,
"uniqueIrrelevantMediablocks": 139,
"uniqueInReviewMediablocks": 4,
"uniqueProcessingMediablocks": 0,
"uniqueTranslatedMediablockWords": 243958,
"uniqueUntranslatedMediablockWords": 101,
"url": "https://www.example.com",
"numberOfPages": 221
},
...
Where: totalMediablocks
is the count of the number of mediaBlocks
that belong to every page
of a site
and so on...
I was hoping to build the above after fetching all of the data but nonetheless, my aggregation works perfectly fine (in Mongo Playground).
When I try to run the same aggregation in Retool, I get the following output:
pipeline not supported
I believe this is because my MongoDB Data Source is Azure's CosmosDB (MongoDB API).
The version stated on my Azure Portal is 4.2
.
When I try to use a copy of my data from MongoDB Atlas, I get a different error:
PlanExecutor error during aggregation :: caused by :: Total size of documents in pages matching pipeline's $lookup stage exceeds 16793600 bytes
The data is just too big!
I attempted to rectify this by creating an API endpoint which processes the same information programatically (fetches all the sites
then for each site
, send off a request for all of the pages
and so on), but that takes just under 3 minutes to return the output I desire... and the default timeout on Retool for queries is 2 minutes (120 seconds).
What are my options here? How should I proceed?