Best way to scrape website without API

JYGY · October 31, 2023, 6:12pm

Hello, new member alert.

I am looking for the best way to bring in data from a website that does not have a API.
I have been able to find URLs that give the data I need in a json format and have them saved in a table in my database.

I am struggling to figure out the best ways to parse, and save the data into my database.

Any help would be appreciated.

abusedmedia · October 31, 2023, 8:18pm

Hi @JYGY

there are tons of ways to scrape a website.
More requirements would be helpful to narrow down the possibilities.

Do you need a one-off scraping and import the data manually?
In this case, there are chrome extensions (Bardeen is good) that allow to configure visually a scraper and save a CSV out of it.

Or do you need an harvester to perform more automatic scraping activities?
Scrapingbee or BrowseAI offer remote browsers you can program.

Hope this help.

JYGY · November 6, 2023, 5:01pm

Hey abusedmedia,

The website I am working with stores a large amount of data in json and has accessible .json urls. I have never needed a plugin in the past to scrape this sort of data but I am not experienced with any of this by any means.

JYGY · November 6, 2023, 5:05pm

My quest was more focused on best practices and methods within Retool to scrape web data.

Should I be using a workflow is JS?

What are the other ways to scrape?

abusedmedia · November 6, 2023, 5:18pm

Hi @JYGY

if the data is available through public json file, you don't need to scrape anything, the data is ready to be consumed.

In Retool you just need a RESTQuery resource and put that URL in it.

Depending of your use-case, the size of the json might be an issue when you need to query in real-time.

Question: what you need to do with such data? Do yo need to put in a table? What's your use-case?

Hope this help.

JYGY · November 7, 2023, 1:04am

THANK YOU!

I do not need the data in real-time. My plan is to use a workflow to call the script 5-10 times per day.

Currently I have the script run manually based off of the selected row in table1.

I am getting the following response:

{
  "kits": [
    {
      "component": "Component 1",
      "modelId": 101,
      "id": 10000,
      "modelYear": "2000",
      "platform": "AA"
    },
    {
      "component": "Component 2",
      "modelId": 102,
      "id": 20000,
      "modelYear": "2000",
      "platform": "AB"
    }
  ]
}

This has been shortened from a list that ranges from ~200-1000 "kits."

Currently I am looking to store an array of the "kits" id's. IE:
[10000, 20000]

I would also like to save a copy of the response in json.

abusedmedia · November 7, 2023, 8:09pm

Glad you solved!

Tess · November 17, 2023, 11:38pm

JYGY:

{
  "kits": [
    {
      "component": "Component 1",
      "modelId": 101,
      "id": 10000,
      "modelYear": "2000",
      "platform": "AA"
    },
    {
      "component": "Component 2",
      "modelId": 102,
      "id": 20000,
      "modelYear": "2000",
      "platform": "AB"
    }
  ]
}

Nice! You can get the ids array using Javascript:

For saving the response, you'll need to connect a resource to save the data. If you're familiar with SQL & don't already have an API or database in mind for this use case, you might consider using Retool's Database feature.