Need Help with Web Scraping

nono · July 3, 2023, 1:29pm

I want to write a program that extracts all the words from a web page and puts them into a file so that the program can scan through for particular words. I can do the latter but extracting the words seems to only be possible in small chunks (each class on the Inspect page). How can I extract all the words without having to write every class?

csm10495 · July 4, 2023, 2:05am

I’ll give a couple hints: I’d say use requests to load the web page and then bs4 to parse it.

Here’s some example code:

import collections
import requests # pip install requests
import string
from bs4 import BeautifulSoup # pip install beautifulsoup4 
from pprint import pprint


def remove_punctuation(s: str) -> str:
    """
    Attempts to trivialy remove punctuations from a string
    """
    return s.translate(str.maketrans("", "", string.punctuation))


def get_word_counter(url: str) -> collections.Counter:
    """
    Returns a counter for text words on a given url

    Note: All words are converted to lowercase
    """
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    all_text = soup.text
    all_text_no_punctuation = remove_punctuation(all_text)
    counter = collections.Counter()
    for word in all_text_no_punctuation.lower().split():
        counter[word] += 1
    return counter


def print_word_count(url: str):
    pprint(get_word_counter(url))


if __name__ == "__main__":
    print_word_count("https://example.com/")

Running that on Python 3.11 gives:

C:\Users\csm10495\Desktop>python scrape_words.py
Counter({'domain': 4,
         'in': 3,
         'example': 2,
         'this': 2,
         'for': 2,
         'use': 2,
         'is': 1,
         'illustrative': 1,
         'examples': 1,
         'documents': 1,
         'you': 1,
         'may': 1,
         'literature': 1,
         'without': 1,
         'prior': 1,
         'coordination': 1,
         'or': 1,
         'asking': 1,
         'permission': 1,
         'more': 1,
         'information': 1})

collections.Counter is good for counting things like words, it’s a slightly fancy dict. so I’ve used it here. For more info on that, see: collections — Container datatypes — Python 3.11.4 documentation.