Python Script Not Parsing Website Data Correctly

joeroot · February 13, 2024, 7:38pm

I’m encountering an issue with a Python script I’ve developed to parse data from a website. The script is designed to scrape specific information from the target website’s HTML, but it’s not retrieving the expected data as intended.

Here’s a simplified version of my Python script:

import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting specific data from the website
        # Example:
        data = soup.find('div', class_='example-class').text
        return data
    else:
        print("Failed to retrieve data from the website.")
        return None

# Example usage
website_url = "https://www.example.com"
scraped_data = scrape_website(website_url)
print(scraped_data)

In this script, I’m using the requests library to send an HTTP GET request to the specified URL and BeautifulSoup to parse the HTML content of the response. Then, I’m attempting to extract specific data from the website, such as text inside a div with a particular class.

However, when I run the script, it either returns None or doesn’t retrieve the expected data. I’ve checked the HTML structure of the target website, and it seems like the class or element I’m targeting exists.

Could someone please help me identify what might be causing the script to fail in parsing the website data correctly and how I can modify it to retrieve the desired information?

facelessuser · February 13, 2024, 7:49pm

Your example doesn’t work because example.com does not have a div with that class.

facelessuser · February 13, 2024, 7:51pm

To be honest, these types of questions are better suited in the BeautifulSoup group: https://groups.google.com/g/beautifulsoup.

c-rob · February 14, 2024, 12:14pm

Welcome to the forums!

To help you we actually need the real website URL because the error may be dependent on the website being odd, or perhaps returning an error.

Also, do you get any error back from the website? Sometimes a site may be down for maintenance, Cloudflare is borked, or the website is just plain busy. Can you add the error number and description to this error message? That would be helpful to show if there’s a specific error.

else: # What's the error here if any?
        print("Failed to retrieve data from the website.")
        return None

When you get some contents what do you get? Do you get some error page? A Cloudflare error page?

kknechtel · February 15, 2024, 5:30am

What exactly is “the expected data”; and according to your understanding of the code, why should it be “expected”? In particular: it seems like you expect that soup.find('div', class_='example-class').text should produce the desired result.

When you retrieve the webpage with JavaScript disabled (or, for example, by using the Requests library, or a command-line tool such as curl, to download it into a file) and look at the page source, do you see the expected data? That is: can you find an element like <div class="example-class">...</div>, where the plain-text part of the ... corresponds to what you want?

joeroot · February 17, 2024, 4:45pm

Thank you for your response. The “expected data” refers to the text content inside a <div> element with the class name “example-class” on the target website. Upon inspecting the HTML source of the webpage without JavaScript enabled or using tools like curl, I can confirm that the <div class="example-class">...</div> structure exists, and the text content within this <div> corresponds to the data I’m attempting to scrape.

However, despite the presence of this structure in the HTML source, the script is either returning None or not retrieving the expected data when executed. Therefore, I’m seeking assistance in understanding why the script fails to parse the website data correctly, especially considering that the targeted element does exist in the HTML source.

kknechtel · February 17, 2024, 10:57pm

In order to diagnose further, we would need a proper example. For a problem like this, that would include a static excerpt of HTML data (a hard-coded HTML string in the program, instead of a Requests call to retrieve it) that demonstrates the problem (something short but still representative of what appears to go wrong).