Trouble Updating Rows in BigQuery Table via Retool

Hi, I have table emailDuplicatesTable in Retool that shows lead duplicates in the database.
The source of this table is BigQuery.
I try to create GUI Query that add specific values in the source BigQuery table when user clicks button "Delete duplicates".
So, I need that columns manually_deleted, reason_for_deletion and deleted_by will be updated and the following values will be added:

manually_deleted: moment().format('YYYY-MM-DD HH:mm:ss'),
reason_for_deletion: 'email_duplicate',
deleted_by: current_user.firstName,

I only need this to be done for rows that do not have 1 in rn column in table emailDuplicatesTable.

I'm trying following query, but this return error -

  • message:"Query timed out after 120000ms"
{{
  emailDuplicatesTable.data.map(row => {
    if (row.rn !== '1') {
      return {
        ...row,
        manually_daleted: moment().format('YYYY-MM-DD HH:mm:ss'),
        reason_for_deletion: 'email_duplicate',
        deleted_by: current_user.firstName,
      };
    }
    return row;
  }).map(row => {
    return {
      manually_daleted: row.manually_daleted,
      reason_for_deletion: row.reason_for_deletion,
      deleted_by: row.deleted_by,
      id: row.id,
    };
  })
}}

My setup:

What I'm missing here? How can I resolve this?

Are the duplicates shown to a user in a table component or something else?

Yes, duplicates are shown in the table component.
As mentioned I only want to update records for those rows which value in column "rn" is not equal to 1.

You should try to run what you have written in a js query and then pass the {{jsquery.data}} into the array of records to update...

I've tried to add my code to js query and then use data of it as an array of records to update, but still get timeout error.
Do I need to update my code somehow?

I updated my JS code a bit to split data in batches.

const batchSize = 100;
const data = emailDuplicatesTable.data
  .filter(row => row.rn !== 1)
  .map(row => ({
    ...row,
    manually_daleted: moment().format('YYYY-MM-DD HH:mm:ss'),
    reason_for_deletion: 'email_duplicate',
    deleted_by: current_user.firstName,
  }))
  .map(row => ({
    manually_daleted: row.manually_daleted,
    reason_for_deletion: row.reason_for_deletion,
    deleted_by: row.deleted_by,
    id: row.id,
  }));

const queries = data.map(row => () => {
  return deleteDuplicateEmailLeads.trigger({
    additionalScope: { data: [row] }, // Wrap 'data' in an array
  });
});

async function runAllQueries(queries, batchSize) {
  const batches = _.chunk(queries, batchSize);
  const results = [];
  while (batches.length) {
    const batch = batches.shift();
    const result = await Promise.all(batch.map(fn => fn()));
    results.push(...result);
  }
  return results;
}

return runAllQueries(queries, batchSize);

My GUI query deleteDuplicateEmailLeads looks as follow:

It works, but for some batches my query failed:
Screenshot 2023-09-05 at 15.16.05

What might be the issue? How to fix this?

I think the Retool team will be able to help you here as I cannot determine the issue based on the information posted. I can only assume there is an issue with the amount of data you are updating or a concurrency issue but I cannot say why.
@Tess @Kabirdas may be able to assist here.

Hi @Tess and @Kabirdas, would you be able to help me with this?

@mondob there are a few issues here:

  1. if you look closer at your JS query, you will notice that you update all rows, including where rn = 1, to itself. Also JS mapping twice is not efficient (map=table scan) and not necessary
  2. GUI query by PK is (more) meant to update rows which a user changed. A few rows here and there can be updated using the PK as predicate to target the updates (usually you would have the PK field matching a clustered by field in BQ)
  3. The BigQuery logs would have told you what exact SQL is being generated by Retool. Here one update statement per row.
  4. the last error message is probably from BigQuery as you have hit the concurrency quota (updates in parallel)

Long story short, the generated SQLs makes BigQuery processing the updates atomically. Not only inefficient but also costly if you read how BQ is priced (each update is a table copy).

Use standard SQL instead to push the heavy lifting to the DBMS which is way more efficient, cost-effective, ... and straightforward:

UPDATE marts.leads_for_retool
SET manually_daleted (deleted?) = CURRENT_TIMESTAMP(),
    reason_for_deletion = 'email_duplicate',
    deleted_by = {{ current_user.firstName }}
WHERE rn != 1

Lastly, instead of updating duplicates and filtering them out (I assume), you can return only one record per lead based on your criteria e.g. if only the first created_on:

SELECT * FROM dataset.table
QUALIFY ROW_NUMBER() OVER (PARTITION BY main_email (whatever identifies a lead uniquely) ORDER BY created_by) = 1

Hope this helps.

Thanks @yiga2! Using standard sql query really sounds like the most suitable solution here. However, how to make the following query run on BigQuery so that it affects marts.leads_for_retool table?

> UPDATE marts.leads_for_retool
> SET manually_daleted (deleted?) = CURRENT_TIMESTAMP(),
>     reason_for_deletion = 'email_duplicate',
>     deleted_by = {{ current_user.firstName }}
> WHERE rn != 1

To make changes available in Bigquery I need to run GUI mode query, isn't it?

Hi @Kabirdas, @Tess could you help me with this, please?