Automated scraping script

I’m trying to create a Python script that automatically scans a URL for all web pages each morning and only shows new pages published since the previous day’s scan. I’ve used Bard’s script as a starting point and imported the necessary modules, but it’s not working.

Here’s the script:

import schedule
import time
import logging
import requests
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText


# Define website URL and email details
website_url = "https://example.com/"
email_sender = "your_email@example.com"
email_receiver = "your_receiver@example.com"
smtp_server = "smtp.example.com"
smtp_port = 587
email_username = "your_email_username"
email_password = "your_email_password"

# Initialize logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.FileHandler("scraper.log")
handler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s"))
logger.addHandler(handler)

# Initialize variables
seen_urls = set()
new_pages = []


def scrape_website():
    global new_pages
    global seen_urls

    # Log attempt
    logger.info("Scrape attempt started.")

    try:
        response = requests.get(website_url)
    except requests.exceptions.RequestException as e:
        logger.error(f"Error during request: {e}")
        return

    # Parse HTML content
    try:
        soup = BeautifulSoup(response.content, "html.parser")
    except Exception as e:
        logger.error(f"Error parsing HTML: {e}")
        return

    # Extract and filter new pages with meta titles
    for link in soup.find_all("a", href=True):
        url = link.get("href")
        title = link.find("title").text.strip()

        if url not in seen_urls:
            seen_urls.add(url)
            new_pages.append({"url": url, "title": title})

    # Log successful scrape
    logger.info(f"{len(new_pages)} new pages found.")


def send_email():
    global new_pages

    if not new_pages:
        return

    # Log email attempt
    logger.info("Email sending attempt started.")

    try:
        # Construct email content
        email_body = "New pages found:\n\n"
        for page in new_pages:
            email_body += f"- **{page['title']}** ({page['url']})\n"

        # Create email message
        message = MIMEText(email_body)
        message["Subject"] = f"New pages found on {website_url}"
        message["From"] = email_sender
        message["To"] = email_receiver

        # Send email
        with smtplib.SMTP(smtp_server, smtp_port) as server:
            server.starttls()
            server.login(email_username, email_password)
            server.sendmail(email_sender, email_receiver, message.as_string())

        # Log email success
        logger.info("Email successfully sent.")
    except Exception as e:
        logger.error(f"Error sending email: {e}")

    # Clear new pages list
    new_pages = []


# Schedule scraping and email tasks
schedule.every().day.at("05:00").do(scrape_website)
schedule.every().day.at("05:10").do(send_email)

# Start processing tasks
while True:
    schedule.run_pending()
    time.sleep(60)

Who is that?

Exactly what happened when you tried it, and how is that different from what should happen?

It means the OP asked an AI to generate code, didn’t understand the code, and now expects us to fix it. Not worth bothering with. Eventually people will figure out that AI-generated code sucks and they’ll go back to actually writing code themselves, at which point it’ll be worth helping them.

1 Like

Hi - “it’s not working” is not very specific. Do you have a specific request or question?
What did you try to do to make it work and what did you run into?

I’ve no idea of how good or bad the current Bard is, but ChatGPT (3.5) usually comes up with correct code for this kind of very run-of-the-mill code (I’ve been testing its limitations for a while now). When the code doesn’t work, you have two options - either to learn more Python yourself, and learn more debugging, or you can ask ChatGPT (or Bard or CoPilot) follow-up questions, pointing out the failures. It usually does come up with something sensible then (though it may also start repeating itself - that is, repeating the same incorrect code – in which case you know, it does still suck when you need it most :slight_smile: ).

Just for kicks, I asked ChatGPT 3.5 the following:

Can you inspect the following Python code and comment on it? Somebody reported that “it is not working”, though it’s unclear what they meant with that.
(then I quoted all the code)

It came up with very pertinent, correct comments, analyzing both the intent and the structure of the code. The comments would definitely be helpful imo to anyone who is a Python novice. It gave a correct list of potential general issues. And it concluded – also very to the point – with:

The user reporting the issue should provide more specific details about the error or undesired behavior encountered for a more accurate diagnosis. Additionally, checking the log file (“scraper.log”) might reveal more information about any errors or issues during the script’s execution.

I followed up with

But do you see any specific issues with the code?

It then started making a (trivial but very weird) mistake, but it also (inadvertently almost) pointed out the real flaws/bugs, and gave a good recommendation about how to improve the code (basically: get rid of the globals, use a special class to keep track of the overall state). So, yes, novice programmers should be wary when using this tool, but at the same time, it can still help and speed up learning.

Eventually, I think those AIs will either make hard-core, manual programming just a hobby for retirees, or the AI tools will be used all the time by almost all programmers. According to some GitHub polls a whopping 92% of US-based developers are already using them. See: Survey reveals AI’s impact on the developer experience - The GitHub Blog.

2 Likes