I am relatively new to python and am trying to automate some analysis. I’m looking to write a code that will go to this website Data Product Details and download the zip files. Then, save a specific excel file in the zip file to a specific file path on my computer. anyone have experience with this process in general?
I suggest you load the page with requests
parse the site and extract the download URLs with beautifulsoup and download it with requests. To extract the excel file take a look at the python built-in module zipfile
and move it with os.rename
or os.replace
.
Also I can’t take a look at the site I only get an “Access Denied” error.
Thank you, Paul. I have the following code based on your guidance, but don’t see the zip file or excel file in the directed folder that i specified. Any ideas? :
import os
import requests
from bs4 import BeautifulSoup
from zipfile import ZipFile
from io import BytesIO
# Step 1: Load the webpage
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-965-ER'
response = requests.get(url)
response.raise_for_status()
# Step 2: Parse the page to extract download URLs
soup = BeautifulSoup(response.content, 'html.parser')
download_links = soup.find_all('a', href=True)
excel_urls = [link['href'] for link in download_links if link['href'].endswith('.zip')]
# Step 3: Download the Excel files
for excel_url in excel_urls:
excel_response = requests.get(excel_url)
excel_response.raise_for_status()
# Step 4: Extract the Excel file from the zip archive
with ZipFile(BytesIO(excel_response.content)) as zip_file:
for zip_info in zip_file.infolist():
if zip_info.filename.endswith('.xlsx'):
zip_file.extract(zip_info, path='.')
extracted_file_path = zip_info.filename
# Step 5: Move the file to the desired location
new_location = os.path.join('desired_directory', zip_info.filename)
os.rename(extracted_file_path, new_location)
print(f"Moved {extracted_file_path} to {new_location}")
# Replace 'desired_directory' with the actual directory you want to move the files to.
Also, the link is a public website but you can try this link since it has the same structure.
https://www.ercot.com/mp/data-products/data-product-details?id=NP4-188-CD
This page is blocked by a security rule.
If the page uses javascript to build itself you will need to use selenium
package to get the page contents.
Thanks, @barry-scott i will try adjusting using selenium
Also make sure that the site’s robots.txt
file will let you download certain things via a web scraper. Their site may check the program’s user-agent
string to see if it’s an actual browser or something else. Maybe you can download a file if you know the exact path, but cannot list the directory contents. This is for security purposes.
I see one of the links to a zip file appears to be a program that requires a parameter, and is not a direct link: https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=1006530342
. You may have to adjust your program to account for this lack of direct link.
Since I haven’t done this before that’s all I can suggest. But getting this to work could be useful as there are a lot of files on that page.
Thank you @c-rob i didn’t think of the security aspect. Appreciate the comment!