Scraping Data from Website

melkaray · September 21, 2023, 1:33pm

Hi, I am trying to learn Scraping Data from Website with python and i tried extract that list ( List of largest companies by revenue - Wikipedia) but it shows 60 columns instead of 8. I added the picture where i confused. ( ‘USD millions’ should be the last column but it continues like 1, 2, 3…). and i added the code. How should i fix it?

That’s the code:

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
page = requests.get(url)


soup = BeautifulSoup(page.text, 'html')


print(soup)

soup.find_all('table')




soup.find('table', class_ = 'wikitable sortable ')




table = soup.find_all('table')[1]

print(table)

world_titles = table.find_all('th')

world_titles

world_table_titles = [title.text.strip() for title in world_titles]
print ( world_table_titles)

import pandas as pd

df = pd.DataFrame(columns = world_table_titles)
df

column_data = table.find_all('tr')

for row in column_data[2:]: 
    row_data = row.find_all('td')
    individual_row_Data = [data.text.strip() for data in row_data]
    lenght = len(df)
    df.loc[lenght] == individual_row_Data

df

Rosuav · September 21, 2023, 4:43pm

Don’t webscrape Wikipedia. Parse the wikitext instead:

kknechtel · September 21, 2023, 7:18pm

This means every th element in the table. If you look at the HTML source for the page, you can see that the “rank” column of the table also uses th elements to label the rows. HTML is more complex than most people realize, even when they try to take that into account

Scraping HTML is a last resort. For sites like Wikipedia that display user-generated content, there’s often a way to get the raw source of what the “users” edited, and that’s generally much easier to work with. As Chris showed, Wikipedia allows access to the wikitext simply by adjusting the URL. It can actually be even simpler than that: MediaWiki sites like Wikipedia offer an API, and there is an wrapper for that API on PyPI.

Topic		Replies	Views
Python webscraping : table in the same url, but in multiple "pages" Python Help	16	679	January 18, 2024
Issues in code while scraping data Python Help	1	307	May 10, 2023
Help with web scraping code Python Help help	0	425	March 15, 2021
Getting the date from webscrape using beautiful soup and pandas Python Help	1	296	September 22, 2022
Web scraping from scientific database Python Help	2	737	February 28, 2020

Scraping Data from Website

Related Topics