Efficiently upserting CSV data without creating duplicates

joostschuur · November 29, 2022, 9:12pm

I have daily reports in CSV format that I want to upload via a file dropzone in an efficient way, and also protect myself against accidentally uploading duplicate data twice. I can get the former working, but don't have a solid strategy for the latter.

I've successfully hooked up a dropzone to a JavaScript Query that currently runs a bulk update query in GUI mode:

(Side Note 1: Since I'm passing data to this querty via additionalScope, the 'array of records to update' box is red, since it doesn't know rows until I trigger this query from elsehwere. Kinda annoying, but it still works).

An upload button triggers this JS query and loops through files int he dropzone to call the query above:

(async () => {
  const reports = reportDropzone.parsedValue;
  var totalRowCount = 0;
  var reportCount = 0;

  for(const report of Object.values(reports)) {
    reportCount++;
    
    const rows = transformReportFields(report);

    uploadProgressText.setValue(`Uploading report #${reportCount} with ${rows.length} rows...)`);
    await uploadReportRows.trigger({
      additionalScope: { rows },
      onSuccess: (data) => successCount++,
      onFailure: (error) => console.error
    });
  }

  uploadProgressText.value.setValue(`Done uploading. Addded ${successCount} rows from ${reportCount} reports)`);
  
  reportDropzone.resetValue();
})();

(Side note 2: For some reason, when referenced inside a JS query, reportDropzone.parsedValue was not an arrray of arrays, but a nested object or objects, hence Object.values here. Weird)

The problem at this stage is that this will insert duplicates. It says 'upsert', but because my CSV files don't have a primary key field, if I run this on the same CSV file, it duplicates every row in the database.

I tried the 'Update a record, or create a new record if it doesn't exist' option, but this didn't work and the individual queries it created meant it only inserted a row per second (each CSV file has about 500-2000 rows). And I still got duplicates. Maybe I misunderstood the way the 'this filter by' section works, but I identified a combination of 4 fields that are basically a unqiue constraint and thought this would prevent duplicates:

Unfortunately, since I'm using the beta Retool Database, I can't add a unque constraint on these 4 fields and then manually create SQL statements that contain ON CONFLICT e.g..

My next idea was to just generate a UUID based on these 4 fields in Retool. The same 4 strings should always generate the same UUID.

I thought I could just grab the CDN version of this npm module and run its getUuid function and pass it in a composite of those 4 fields, but I couldn't find a compatible CDN link for it (this didn't unclude a bundled version with all the dependencies).

I also couldn't call the existing uuid library on Retool, since the relevant v4 method e.g. was not working for me (even then, I'm not sure if it could take in string to produce a consistent UUID from the same string):

CleanShot 2022-11-29 at 20.38.05

So, at this point, I'm still looking for some guidance on I could avoid duplicate rows for my use case, or generate a UUID form a string.

My fallback is to just query the database before the bulk insert, read in the current day's worth of data based on the date of the CSV data and identify existing rows this way before I even trigger uploadReportRows. That ought to work, just sounds a little clunky.

(Final side note: for some reason, that final part of my uploadReports JS query from above doesn't seem to run, since the progress text doesn't update and the dropzone doesn't reset:

  uploadProgressText.value.setValue(`Done uploading. Addded ${successCount} rows from ${reportCount} reports)`);
  
  reportDropzone.resetValue();

Not sure if this is some async await thing or what.

Mendy · November 29, 2022, 9:24pm

What if instead of creating a uuid, just concatenate all the unique value into one string and store it as the id?

joostschuur · November 29, 2022, 10:28pm

I did just find the option to edit my primary key in the Retool Database settings and you can create a composite key. This seems to be a solution for my main problem.

That dropdown menu under the table name tab was a bit too easy to miss though

joostschuur · November 29, 2022, 10:30pm

Good idea, and I thought about that too. It seemed a little inefficient to create such a large string for that many rows of data though, since it could easily end up being several hundred bytes.

Mendy · November 29, 2022, 10:50pm

Absolutely, it’s not pretty…
You might also want to try this npm package maybe it’ll work

If you’re using retool database which as far as I know - as of now doesn’t have any data limitation you can have another table just for duplicate checks (using the method I mentioned) and then save it to the other table for faster future querying.

joostschuur · November 29, 2022, 11:20pm

I saw that one, but I couldn't find a CDN version of it that seemed to work with Retool. JSDelivr and Unpkg both seemed to have versions that didn't bundle the dependencies, so not sure how
those can work.

Using a composite key for the primary key works so far, although my JS query still doesn't seem to run the last bit of code to clear the dropzone or do a final text setValue progress update (as listed at the end of my original post).

Kabirdas · November 29, 2022, 11:23pm

Hey folks! Just want to hop in to add that you can also run SQL queries from within an app to manage RetoolDB e.g. you can run a regular CREATE TABLE query and define the constraints there as well

Additionally, the uuid.v4 function should be working:

It looks as though there may be a bug after calling the function in the debug console, definitely something to look into on our end. Thanks for surfacing it! Can you try reloading your app and running it in a JS query and/or transformer to see if that works?

joostschuur · December 1, 2022, 7:37pm

So curiously enough, the first time I just tried this in an existing Retool browser tab that had been open for a while, I successfully called uuid.v4() both in a transformer and the console.

Then a few minutes later both didn't work. uuid.v4 is not a function from both. Even reloading the app by refreshing the tab didn't make it work again. Only opening the same app up in a new tab made it work again. But then I couldn't get it to not wortk again.

I've worked around the original need to generate UUIDs anyway, but in the workaround that I had in mind, I would have needed to generate an identical one from the same string like https://github.com/Danakt/uuid-by-string/ offered.

@Kabirdas, a I correct in assuming that I can only load libraries from a CDN when they contain all their dependencies? I don't really understand what would consume a CDN URL like this one, since it contains this require("./lib").

If I ever did need to use an npm library that I can't find a valid CDN link for, I imagine that I could always create a fully bundled & minified file and copy that code into a JS Query. Not sure what the size limit there is and if that works, but worth a try down the line

I also solved the 'final side note' problem mentioned earlier. Obviously uploadProgressText.value.setValue( should have been uploadProgressText.setValue(. I've actually run into this a few times where a JS query has a syntax error, if you use 'Run', it'll say Query ran successfully and unless you look at the console, you won't see the error. 'Preview' on the other hand will show the error outside of the console.

Kabirdas · December 3, 2022, 2:10am

Ah bizarre thanks for that additional context! I've filed a bug report with our dev team and can report back here when there's fix for future reference. Good to hear you have a different solution

And yes, you're assumption is correct. We recommend folks look for libraries that have UMD builds they can import. I don't know that there's a strict size limit on JS queries. If you're creating a fully bundled & minified copy of your library it may be worth hosting it on your own CDN, though you can also write your own preloaded JavaScript both at the app and org level.

Also nice catch on the final side note! To your point about JS queries failing silently - that's another issue I can report back on this one is known and we can let you know when there's a fix! Thanks for bumping it.

Kabirdas · January 26, 2023, 12:17am

Hey folks! Just want to report here that we're rolling out an experimental feature flag to Cloud orgs that should cause JavaScript queries to fail properly. This would be either with a syntax error or by setting an error handler. If you're interested in trying it out please feel free to let me know!

Topic		Replies	Views
First class JavaScript functions (with workaround) 💬 Feature Requests	10	9176	March 20, 2024
Temporarily state value is not updated in an expected way 💬 App Building bug , javascript	13	3858	June 15, 2023
Bug Sync functions allow local variables in Javascript queries, but not Transformer queries 💬 App Building bug , javascript	9	86	January 10, 2025
Not able to write timestamps in firestore 💬 Queries and Resources firestore , dates	25	4694	August 1, 2023
Significant delay on app load (before queries run) 💬 App Building bug	20	1621	May 15, 2025

Efficiently upserting CSV data without creating duplicates

Related topics