Web scraping and download zipfiles

energy_py · May 22, 2024, 5:21pm

I am relatively new to python and am trying to automate some analysis. I’m looking to write a code that will go to this website Data Product Details and download the zip files. Then, save a specific excel file in the zip file to a specific file path on my computer. anyone have experience with this process in general?

DerSchinken · May 22, 2024, 7:44pm

I suggest you load the page with requests parse the site and extract the download URLs with beautifulsoup and download it with requests. To extract the excel file take a look at the python built-in module zipfile and move it with os.rename or os.replace.

Also I can’t take a look at the site I only get an “Access Denied” error.

energy_py · May 22, 2024, 8:38pm

Thank you, Paul. I have the following code based on your guidance, but don’t see the zip file or excel file in the directed folder that i specified. Any ideas? :

import os
import requests
from bs4 import BeautifulSoup
from zipfile import ZipFile
from io import BytesIO

# Step 1: Load the webpage
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-965-ER'
response = requests.get(url)
response.raise_for_status()

# Step 2: Parse the page to extract download URLs
soup = BeautifulSoup(response.content, 'html.parser')
download_links = soup.find_all('a', href=True)
excel_urls = [link['href'] for link in download_links if link['href'].endswith('.zip')]

# Step 3: Download the Excel files
for excel_url in excel_urls:
    excel_response = requests.get(excel_url)
    excel_response.raise_for_status()
    
    # Step 4: Extract the Excel file from the zip archive
    with ZipFile(BytesIO(excel_response.content)) as zip_file:
        for zip_info in zip_file.infolist():
            if zip_info.filename.endswith('.xlsx'):
                zip_file.extract(zip_info, path='.')
                extracted_file_path = zip_info.filename
                
                # Step 5: Move the file to the desired location
                new_location = os.path.join('desired_directory', zip_info.filename)
                os.rename(extracted_file_path, new_location)
                print(f"Moved {extracted_file_path} to {new_location}")

# Replace 'desired_directory' with the actual directory you want to move the files to.

Also, the link is a public website but you can try this link since it has the same structure.
https://www.ercot.com/mp/data-products/data-product-details?id=NP4-188-CD

barry-scott · May 22, 2024, 9:10pm

This page is blocked by a security rule.

If the page uses javascript to build itself you will need to use selenium package to get the page contents.

energy_py · May 23, 2024, 2:10pm

Thanks, @barry-scott i will try adjusting using selenium

c-rob · May 25, 2024, 6:58pm

Also make sure that the site’s robots.txt file will let you download certain things via a web scraper. Their site may check the program’s user-agent string to see if it’s an actual browser or something else. Maybe you can download a file if you know the exact path, but cannot list the directory contents. This is for security purposes.

I see one of the links to a zip file appears to be a program that requires a parameter, and is not a direct link: https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId=1006530342. You may have to adjust your program to account for this lack of direct link.

Since I haven’t done this before that’s all I can suggest. But getting this to work could be useful as there are a lot of files on that page.

energy_py · June 11, 2024, 4:18pm

Thank you @c-rob i didn’t think of the security aspect. Appreciate the comment!