Scraping Data from Website

Hi, I am trying to learn Scraping Data from Website with python and i tried extract that list ( List of largest companies by revenue - Wikipedia) but it shows 60 columns instead of 8. I added the picture where i confused. ( ‘USD millions’ should be the last column but it continues like 1, 2, 3…). and i added the code. How should i fix it?

That’s the code:

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
page = requests.get(url)


soup = BeautifulSoup(page.text, 'html')


print(soup)

soup.find_all('table')




soup.find('table', class_ = 'wikitable sortable ')




table = soup.find_all('table')[1]

print(table)

world_titles = table.find_all('th')

world_titles

world_table_titles = [title.text.strip() for title in world_titles]
print ( world_table_titles)

import pandas as pd

df = pd.DataFrame(columns = world_table_titles)
df

column_data = table.find_all('tr')

for row in column_data[2:]: 
    row_data = row.find_all('td')
    individual_row_Data = [data.text.strip() for data in row_data]
    lenght = len(df)
    df.loc[lenght] == individual_row_Data

df

Don’t webscrape Wikipedia. Parse the wikitext instead:

This means every th element in the table. If you look at the HTML source for the page, you can see that the “rank” column of the table also uses th elements to label the rows. HTML is more complex than most people realize, even when they try to take that into account :slight_smile:

Scraping HTML is a last resort. For sites like Wikipedia that display user-generated content, there’s often a way to get the raw source of what the “users” edited, and that’s generally much easier to work with. As Chris showed, Wikipedia allows access to the wikitext simply by adjusting the URL. It can actually be even simpler than that: MediaWiki sites like Wikipedia offer an API, and there is an wrapper for that API on PyPI.