I want to write a program that extracts all the words from a web page and puts them into a file so that the program can scan through for particular words. I can do the latter but extracting the words seems to only be possible in small chunks (each class on the Inspect page). How can I extract all the words without having to write every class?
I’ll give a couple hints: I’d say use requests
to load the web page and then bs4
to parse it.
Here’s some example code:
import collections
import requests # pip install requests
import string
from bs4 import BeautifulSoup # pip install beautifulsoup4
from pprint import pprint
def remove_punctuation(s: str) -> str:
"""
Attempts to trivialy remove punctuations from a string
"""
return s.translate(str.maketrans("", "", string.punctuation))
def get_word_counter(url: str) -> collections.Counter:
"""
Returns a counter for text words on a given url
Note: All words are converted to lowercase
"""
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
all_text = soup.text
all_text_no_punctuation = remove_punctuation(all_text)
counter = collections.Counter()
for word in all_text_no_punctuation.lower().split():
counter[word] += 1
return counter
def print_word_count(url: str):
pprint(get_word_counter(url))
if __name__ == "__main__":
print_word_count("https://example.com/")
Running that on Python 3.11 gives:
C:\Users\csm10495\Desktop>python scrape_words.py
Counter({'domain': 4,
'in': 3,
'example': 2,
'this': 2,
'for': 2,
'use': 2,
'is': 1,
'illustrative': 1,
'examples': 1,
'documents': 1,
'you': 1,
'may': 1,
'literature': 1,
'without': 1,
'prior': 1,
'coordination': 1,
'or': 1,
'asking': 1,
'permission': 1,
'more': 1,
'information': 1})
collections.Counter
is good for counting things like words, it’s a slightly fancy dict. so I’ve used it here. For more info on that, see: collections — Container datatypes — Python 3.11.4 documentation.