Querying from Parquet

aac · November 18, 2022, 4:47am

Is anyone using Retool to interact with data in Parquet format? Is there a workflow you'd recommend?

Is is possible to have Retool interface directly with the Parquet files without importing them into something else? (I may have been reading DuckDB docs lately).

Kabirdas · November 30, 2022, 1:57am

Hey @aac!

I've done some digging on this but haven't yet found a UMD build of a library that handles Parquet files. Admittedly, I don't have experience with the Parquet format - is there a particular use case that you're interested in?

aac · November 30, 2022, 3:55am

Hey @Kabirdas!

We have an S3 bucket with a big batch of Parquet files in it, which gets updated regularly by an external source. I'm new to the format as well, but the goal is to be able to use SQL to query the underlying data, ideally without having to setup a pipeline to import the data from the Parquet files in the bucket into a separate DB every time there's an update.

It was just by chance that I read some DuckDB docs around the same time and saw that there's a way to work directly with the parquet files through that. I haven't been able to get it to work locally reading directly from the S3 bucket yet.

Restating: we'd like some of our end-users to be able to write queries against that data, and it'd be rad if we could do that from the source files instead of needing to import them into a separate DB. That's pretty much the use case.

Kabirdas · November 30, 2022, 7:10pm

Ahh, interesting!

It looks as though Athena has some methods for querying Parquet data stored in S3 (see this blog). If that fits, Retool has an Athena integration you might be able to use. Is that something you've checked out by any chance?

aac · December 1, 2022, 5:27am

I'd seen the Athena piece, but didn't realize there was a Retool integration. I'll look into it and circle back. Thanks.