Best way to scrape website without API

Hello, new member alert.

I am looking for the best way to bring in data from a website that does not have a API.
I have been able to find URLs that give the data I need in a json format and have them saved in a table in my database.

I am struggling to figure out the best ways to parse, and save the data into my database.

Any help would be appreciated.

Hi @JYGY

there are tons of ways to scrape a website.
More requirements would be helpful to narrow down the possibilities.

Do you need a one-off scraping and import the data manually?
In this case, there are chrome extensions (Bardeen is good) that allow to configure visually a scraper and save a CSV out of it.

Or do you need an harvester to perform more automatic scraping activities?
Scrapingbee or BrowseAI offer remote browsers you can program.

Hope this help.

1 Like

Hey abusedmedia,

The website I am working with stores a large amount of data in json and has accessible .json urls. I have never needed a plugin in the past to scrape this sort of data but I am not experienced with any of this by any means.

My quest was more focused on best practices and methods within Retool to scrape web data.

Should I be using a workflow is JS?

What are the other ways to scrape?

Hi @JYGY

if the data is available through public json file, you don't need to scrape anything, the data is ready to be consumed.

In Retool you just need a RESTQuery resource and put that URL in it.

Depending of your use-case, the size of the json might be an issue when you need to query in real-time.

Question: what you need to do with such data? Do yo need to put in a table? What's your use-case?

Hope this help.

2 Likes

THANK YOU!

I do not need the data in real-time. My plan is to use a workflow to call the script 5-10 times per day.

Currently I have the script run manually based off of the selected row in table1.

I am getting the following response:

{
  "kits": [
    {
      "component": "Component 1",
      "modelId": 101,
      "id": 10000,
      "modelYear": "2000",
      "platform": "AA"
    },
    {
      "component": "Component 2",
      "modelId": 102,
      "id": 20000,
      "modelYear": "2000",
      "platform": "AB"
    }
  ]
}

This has been shortened from a list that ranges from ~200-1000 "kits."

Currently I am looking to store an array of the "kits" id's. IE:
[10000, 20000]

I would also like to save a copy of the response in json.

Glad you solved!

Nice! You can get the ids array using Javascript:

For saving the response, you'll need to connect a resource to save the data. If you're familiar with SQL & don't already have an API or database in mind for this use case, you might consider using Retool's Database feature.

Hi @JYGY ,

Since you've already found the URLs with the JSON data you need, you're off to a great start. For parsing and saving the data into your database, you could use a combination of Python and libraries like Requests to fetch the data and Pandas to parse and structure it. Here’s a quick example:

Fetch the data using requests:

import requests
response = requests.get('your_json_url')
data = response.json()

Parse and structure the data using Pandas:

import pandas as pd
df = pd.DataFrame(data)

Save the data into your database (assuming you're using SQL): You can use a library like SQLAlchemy to connect to your database and save the data:

from sqlalchemy import create_engine
engine = create_engine('your_database_connection_string')
df.to_sql('your_table_name', engine, if_exists='replace')

If you're doing a lot of scraping without an API, you may also run into rate limits or get blocked. In that case, using a tool like Multilogin could help you stay undetected while gathering data. It’s great for rotating IPs and creating unique browser profiles. You can check it out here.

Hope this helps!