Issue with BeautifulSoup multiple find() and regression

tng99 · January 18, 2025, 7:34am

Script find_multi.py: see below

File test.html (6337 bytes; 6 data rows) get from:
Dropbox

I created a short test.html file (based on a real-life scenario) to test BeautifulSoup. The expected output should consist of 6 rows. Each row contains 4 lines, specifically 2 div elements and 2 span elements, each with specific classes.

My questions are:

I only get 3 rows from the test file, even though print(all_books) shows that all 6 rows are present. Why is this happening?
The output is as follows: 200 400 200
However, the last line (referred to as Item[‘C’] in my script) should output 600. This issue occurs because the class “row” in the 4th line is identical to the one in the “200” line.

My thought for solving this issue was to combine the 1st and 4th lines to get “600”. I tried using find().find() and also regular expressions, but both returned None.

Where did I go wrong? Please help, thanks.

Nb: the output should be 200 400 600 100 300 500
Nb: inside the code, all three of my attempts are there.

****script BEGIN, find_multi.py

import sys
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

#f = open('output.txt','w')
#sys.stdout = f

with open("test.html") as fp:
     soup = BeautifulSoup(fp, 'html.parser')

all_books = soup.find_all("div",class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 LSuXj Row__StyledRow-sc-1iamenj-0 dKWNAz Aligned-sc-20a62a68-0 dstbXj")   
#print(all_books)
#exit()
for book in all_books:
            item = {}

            item['A'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()
            print(item['A'])

            item['B'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 goWdRL Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()
            print(item['B'])

            item['C'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()
            #item['C'] = book.find(class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").find(class_="Typography__Span-sc-10mju41-0 iwRMOt Typography__StyledTypography-sc-10mju41-1 cTYWqN Number-styles__StyledTypography-sc-9545e837-1 fYwcsm")
            #item['C'] = book.find(attrs={"class": re.compile(r".*euFDfP.*fYwcsm")})
            print(item['C'])

harrison · January 18, 2025, 1:38pm

You can think of BeautifulSoup as navigating a tree. Maybe it would help if I described what I’m understanding your code to do.

soup = BeautifulSoup(fp, 'html.parser')

Parses the HTML file, resulting in BeautifulSoup pointing at the html element.

all_books = soup.find_all("div",class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 LSuXj Row__StyledRow-sc-1iamenj-0 dKWNAz Aligned-sc-20a62a68-0 dstbXj")

Returns a list of all divs that have that class – only one in the example file, but I assume the real HTML has multiple of these. This div will be a Soup object that points to the div element instead of the html element, so the tree has been navigated by one element.

for book in all_books:

Loop over every div Soup object found earlier. For the given example HTML, this will only be once.

item['A'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()

You already have navigated the tree to the book div, now you’re finding the div inside of book that matches this class, getting its text, and storing that as item['A'].

item['B'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 goWdRL Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()

Going back to the book div, find the div that matches this class and store its text in item['B'].

item['C'] = book.find("div", class_="Flexbox__StyledFlexbox-sc-1ob4g1e-0 ecTQmb Cell__StyledFlexbox-sc-icfddc-0 kIbrsL Number-styles__StyledFlexTableCell-sc-9545e837-0 euFDfP").text.strip()

Going back to the book div, find the div that matches this class and store its text in item['C'].

To answer your questions specifically,

You’re only printing three rows, for item['A'], item['B'], and item['C']. I assume that you will need to add D, E, and F.
Think of your book variable as saving your place in the tree. You might be thinking that, since you already got the 200 in A, the next time you try to find this class, it will skip to the second instance of the class. Instead, you’re going back to the beginning and finding the same class again.

The reason using find().find() didn’t work is because you navigated the tree to the div for item['A'], but the real item['C'] that you want isn’t part of that subtree.

Here are some hints that might be helpful.

Your goal is to navigate the HTML tree to find the data you want. There are a lot of ways to do this, other than using classes. Think about the structure of the data and how to get from where your Soup currently is in the tree to where you want it to be.
One that stands out to me as potentially useful are the role attributes. You have a row and several cells that could be looped over.

Trying to parse data out of complex HTML is definitely tricky.

tng99 · January 18, 2025, 9:27pm

Thank you very much for your kind help, Harrison. You indeed gave me some ideas :-).
a) My understanding of soup has improved now, as well as the meaning of using find() or find_all().
b) Yes, using the role attribute is much better and simpler.
c) My sample test.html was wrong; I missed the “table”. Something like, there is no meat in my soup :-).
Now I get all the rows, and the output data is correct.

Thanks to ChatGPT for the demo below, it’s exactly what I needed for my code (except the ‘td’ cell; my is just rows.append([row.text]) ).

Thanks very much for your time again:-)

from bs4 import BeautifulSoup

html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Accessible Table</title>
    <style>
        table {
            border-collapse: collapse;
            width: 50%;
            margin: 20px auto;
        }
        th, td {
            border: 1px solid #000;
            text-align: center;
            padding: 8px;
        }
        th {
            background-color: #f4f4f4;
        }
    </style>
</head>
<body>
    <table role="table">
        <thead>
            <tr>
                <th role="cell">A</th>
                <th role="cell">B</th>
                <th role="cell">C</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td role="cell">11</td>
                <td role="cell">22</td>
                <td role="cell">33</td>
            </tr>
            <tr>
                <td role="cell">55</td>
                <td role="cell">66</td>
                <td role="cell">77</td>
            </tr>
            <tr>
                <td role="cell">97</td>
                <td role="cell">98</td>
                <td role="cell">99</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
"""

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Locate the table using the role attribute
table = soup.find('table', role='table')

# Extract headers (from <th> elements with role="cell")
headers = [header.text for header in table.find_all('th', role='cell')]

# Extract rows (from <td> elements with role="cell")
rows = []
for row in table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all('td', role='cell')
    rows.append([cell.text for cell in cells])

# Display the extracted data
print("Headers:", headers)
print("Rows:")
for row in rows:
    print(row)