Script find_multi.py: see below
File test.html (6337 bytes; 6 data rows) get from:
Dropbox
I created a short test.html file (based on a real-life scenario) to test BeautifulSoup. The expected output should consist of 6 rows. Each row contains 4 lines, specifically 2 div elements and 2 span elements, each with specific classes.
My questions are:
-
I only get 3 rows from the test file, even though print(all_books) shows that all 6 rows are present. Why is this happening?
The output is as follows: 200 400 200 -
However, the last line (referred to as Item[‘C’] in my script) should output 600. This issue occurs because the class “row” in the 4th line is identical to the one in the “200” line.
My thought for solving this issue was to combine the 1st and 4th lines to get “600”. I tried using find().find() and also regular expressions, but both returned None.
Where did I go wrong? Please help, thanks.
Nb: the output should be 200 400 600 100 300 500
Nb: inside the code, all three of my attempts are there.
****script BEGIN, find_multi.py
import sys
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
#f = open('output.txt','w')
#sys.stdout = f
with open("test.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
all_books = soup.find_all("div",class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 LSuXj Row__StyledRow-sc-1iamenj-0 dKWNAz Aligned-sc-20a62a68-0 dstbXj")
#print(all_books)
#exit()
for book in all_books:
item = {}
item['A'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()
print(item['A'])
item['B'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 goWdRL Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()
print(item['B'])
item['C'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()
#item['C'] = book.find(class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").find(class_="Typography__Span-sc-10mju41-0 iwRMOt Typography__StyledTypography-sc-10mju41-1 cTYWqN Number-styles__StyledTypography-sc-9545e837-1 fYwcsm")
#item['C'] = book.find(attrs={"class": re.compile(r".*euFDfP.*fYwcsm")})
print(item['C'])