Trying to scrape and download zipfiles

millerdrax · October 4, 2023, 9:22pm

I have tried a million codes and none are working to scrape this website and download and unzip the csv zipfiles. any help or direction would be appreciated!!!

https://www.ercot.com/mp/data-products/data-product-details?id=NP4-188-CD

kknechtel · October 5, 2023, 5:06pm

How did you get the code? Did you understand it?

How did you try to use it?

Exactly what happened when you tried using it? “It doesn’t work” does not describe a problem.

That is many separate things to do, that have nothing to do with each other. So, which part failed?

millerdrax · October 5, 2023, 6:15pm

I switched and used this code

importing necessary modules

import requests, zipfile
from io import BytesIO
print(‘Downloading started’)

#Defining the zip file URL 2022
url = ‘https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=886625668’

Split URL to get the file name

filename = url.split(‘/’)[-1]

Downloading the file by sending the request to the URL

req = requests.get(url)
print(‘Downloading Completed’)

extracting the zip file contents

zipfile= zipfile.ZipFile(BytesIO(req.content))
zipfile.extractall(‘J:/Taylor/ERCOT-LNG-ABHI/Capacity Clearing Prices/2022’)

it worked, I read the tutorial so I understand what each code is trying to do

millerdrax · October 5, 2023, 6:21pm

I’m now trying to run the code so I can download multiple zipfiles rather than running it for every file address

millerdrax · October 5, 2023, 6:26pm

I tried to run this script to scrape the file links but the file links didn’t populate, just the other links on the page

import requests

from bs4 import BeautifulSoup

url = 'https://www.geeksforgeeks.org/'

reqs = requests.get(url)

soup = BeautifulSoup(reqs.text, 'html.parser' )

urls = []

for link in soup.find_all( 'a' ):

print (link.get( 'href' ))

hansgeunsmeyer · October 5, 2023, 7:55pm

I assume that you are now referring to the original url with the zips.

requests.get just downloads the static html page, so it doesn’t download what is generated (client-side in a browser) by running various js scripts. So, the returned reqs.txt is not the same what you see in a browser or web-inspector — none of the table content is there for instance. If you want to do this in Python, you need other tools than just BeautifulSoup (You basically need sth that either acts as a full-fledged browser or is able to interact with your current browser, Apparently Selenium can do this. Cannot really help you further with that, however, since I never worked with that package.)

millerdrax · October 9, 2023, 3:37pm

Thank you! I will look into Selenium

millerdrax · October 9, 2023, 3:37pm

I will try R as well

jhanarato · October 10, 2023, 12:48am

That would be “or” not “r”.

Can be done though: