Workflow Memory Leak Increased runtimes and memory usage

I'm running into an issue where my workflow's memory usage and it's runtime increases each scheduled run. And only the last JS code block and loop block where I'm sending emails and updating records is increasing in runtime. All other blocks are ms long.
image

I've been tasked with sending aprox. 1.7 million emails to our users. It's a general policy update that everyone legally needs.

The first pattern I saw was a JS code block that was increasing in runtime length each run- adding a couple seconds each time. This Js code block was running a loop with a call to the resend send batch email api endpoint, sending an email object with html markup and 50 email addresses with a subject and from field, then calling a workflow function to batch update records that were inserted into a postgres table. Here's that current code block:

const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
const timestamp = moment().format();

const results = [];
for (const [index, value] of create100EmailObjects.data.entries()) {
  // Trigger the email sending function and wait for it to complete
  const result = await sendEmailLoopBlock_lambda.trigger();

  // create objects to batch update the current chunk of emails going out
  const emailsForQuery = value.map((email) => ({
    email: email,
    status: "sent",
    sent_at: timestamp,
    error_message: "none",
  }));

  await updateUserRecords(emailsForQuery);

  results.push(result);

  await delay(150);
}

// Return all the results once the loop is done
return results;

This is all in a loop block currently with the lambda being a rest api request to resend's batch email endpoint. This was all happening initially in a JS code block and then I decided to break out this process into separate blocks not knowing if that would remove the memory issue. But that did not work either.

The first blocks in the code are querying the 1.7. mil emails from a redshift db table, then another block is running simultaneously to query the postgresql table that has records of the emails that have already been sent. I'm filtering out the user emails that have already received the email and then in another JS code block, I'm creating a large array of email arrays from 4900 emails. Each array has 49 emails to send to a bcc field in an email object that is being sent to the batch resend request. I'm looping through those 100 objects because I want to handle errors that may happen during sending.

I've tested different amount of emails for 49 during 1 run, 490, 980, 1960 to 4900. There is a slow increase in the memory used during each run per the screenshot no matter how I refactor the workflow itself.

Has anyone run into something similar? Am I just handling large amounts of data incorrectly?

Any input is appreciated! Thank you!

2 Likes

Hey @micriver - that's a lot of emails!

I spent some time digging into this and have some insights to share. First, the ~86MB usage that you see here is actually a rough measure of data transferred to and from your workflow and not of memory. This comes from a time past when our pricing model for workflows was based on ingress/egress and not on the number of runs. It seems natural that there would be some degree of variation between runs, so I don't believe there is any underlying issue there.

It's much harder to say what might be causing an observed increase in runtime for that particular block. In your testing, how many times have you let the workflow execute and how much as the runtime changed? Are you able to share that data?

Okay, so that definitely answers the memory question! However, we're left with the increasing runtime for the JS block now... I'm not sure what is happening with that since each time the workflow runs, the number of objects being looped through does not change. I've let the workflow run several times since Thursday night (9/14), and since the first run, the block's execution time started at 618 seconds and now, this morning, is at 732 seconds. I'm having to extend the intervals between my scheduled runs because they are overlapping, and I'm now sending fewer emails per hour. The runtime has increased by approximately 114 seconds over the course of the runs. With a total of 516 runs so far, this means the average increase per run is about 221 milliseconds. What could be causing the increase in execution time?

Here is the first run:
image

Here is the latest run and all the block's runtimes (the sendEmailLoopBlock is the last block; it's not even showing milliseconds anymore, which could be due to my browser):
image

1 Like

How many emails have you been sending per workflow run? It must be less than the 4900 you described originally because otherwise you'd be done!

And which block is seeing the most increase in runtime? Just based on the code that you've shared, the filterUsersReceivedEmail would conceivably take more and more time to run as the length of getUsersFromPostgres.data increases, but 200ms seems disproportionally large.

1 Like

Ah yes! Since your reply last week, I've refactored the code to run and send 2000 emails each run now because the time it took to run was exceeding the workflow's time limit. It's not sending 2000 emails each time though because some emails are being removed after validation. So 2000, give or take on average

The block seeing the increased run time is the sendEmailLoopBlock but for some reason it's showing 0ms now, I'm not sure why though:
image

Also a side note, this a self hosted temporal instance so I'm not sure if that could be part of what's happening here. But at first glance, the logs that we were seeing on the backend are the same we see on the frontend.

1 Like

Hmm very interesting! Do all runs in your history show 0ms for that particular block? The one in your screenshot seems to indicate that it's still in progress.

As far as next steps are concerned, nothing obvious jumps out to me. Given that you're self-hosted, I'd probably take a look at the logs for your code-executor container. That's where this code block would run and, in theory, where we might be able to glean additional information.

The other suggestion is to break up this single block into a sequence of smaller steps. This could allow us to see more specifically where the slowdown is happening.

1 Like