I need your help. I have to extract some information provided in a table format, in this website.
Basically, I want to extract the data that are at the middle of the page (you have to scroll down a bit). There are 31 pages, but I have never done webscraping in the past, even less so on Python.
Could anyone help me with that, please?
Thank you very much in advance for your help!
Best,
Well, I tried another type of code, as selenium seems too complicated for me, as I am a novice on Python.
Here is what I tried:
import pandas as pd
import time
from selenium import webdriver
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://elpais.com/clima-y-medio-ambiente/2023-02-20/el-mapa-de-los-macroproyectos-de-energia-renovable-viaje-al-proximo-boom-solar-y-eolico-en-espana.html').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.find_all('table')
rows = tables.find_all('tr')
cols = rows.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['1','2', '3', '4', '5', '6', '7'])
df = df.iloc[1:]
print(df)
But I obtained the following error and I don’t understand what’s the problem. Could anyone help me, please?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[106], line 5
2 soup = bs.BeautifulSoup(source, 'lxml')
4 tables = soup.find_all('table')
----> 5 rows = tables.find_all('tr')
6 cols = rows.find_all('td')
7 cols = [item.text.strip() for item in cols]
File ~\AppData\Local\anaconda3\Lib\site-packages\bs4\element.py:2428, in ResultSet.__getattr__(self, key)
2426 def __getattr__(self, key):
2427 """Raise a helpful exception to explain a common code fix."""
-> 2428 raise AttributeError(
2429 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
2430 )
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I’ve tried enclosing my xpath in " ", and that doesn’t seem to work either: When you wrote string, did you mean str()? I’m still not used to Python notation, sorry.
from selenium.webdriver.common.by import By
l = driver.find_element(By.XPATH, "//*[@id="chart"]/div/div/div[2]/table/tbody/tr"))
Do you have any other suggestions appart from using xpath(), please?
I would be more than happy to know your opinion, please.
My idea as a beginner is to take each line, make a loop, append them, and do this for all 31 pages.
The stackoverflow article shows an xpath api being used.
Suggest you read the selenium docs and web search for using xpath.
I do not have experience if using xpath.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
url = "https://elpais.com/clima-y-medio-ambiente/2023-02-20/el-mapa-de-los-macroproyectos-de-energia-renovable-viaje-al-proximo-boom-solar-y-eolico-en-espana.html"
# Start the web driver
driver = webdriver.Chrome()
# Navigate to the URL
driver.get(url)
# Wait for the table to load (adjust the timeout as needed)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located((By.XPATH, "//table[@class='medium datawrapper-o9C5O-41clw5 svelte-mcdlwv striped compact resortable']")))
data_list = []
while True:
# Extract data from the current page
rows = table.find_elements(By.XPATH, "//table[@class='medium datawrapper-o9C5O-41clw5 svelte-mcdlwv striped compact resortable']//tr")
for row in rows:
columns = row.find_elements(By.TAG_NAME, "td")
data_list.append([col.text.strip() for col in columns])
# Find the next page button
next_button = driver.find_element(By.XPATH, "//a[@class='pagination export-hide datawrapper-o9C5O-xyhjtd svelte-1ya2siw']")
# Check if there is a next page
if next_button:
# Click the next page button
next_button.click()
else:
break # Exit the loop if there is no next page
# Close the web driver
driver.quit()
# Convert the data_list to a Pandas DataFrame
df = pd.DataFrame(data_list, columns=['Column1', 'Column2', 'Column3', ...])
# Save the DataFrame to an Excel file
df.to_excel('output_data.xlsx', index=False)
But again, I obtained an error back:
---------------------------------------------------------------------------
TimeoutException Traceback (most recent call last)
Cell In[1], line 18
16 # Wait for the table to load (adjust the timeout as needed)
17 wait = WebDriverWait(driver, 10)
---> 18 table = wait.until(EC.presence_of_element_located((By.XPATH, "//table[@class='medium datawrapper-o9C5O-41clw5 svelte-mcdlwv striped compact resortable']")))
20 data_list = []
22 while True:
23 # Extract data from the current page
File ~\AppData\Local\anaconda3\Lib\site-packages\selenium\webdriver\support\wait.py:105, in WebDriverWait.until(self, method, message)
103 if time.monotonic() > end_time:
104 break
--> 105 raise TimeoutException(message, screen, stacktrace)
TimeoutException: Message:
Stacktrace:
GetHandleVerifier [0x00007FF7A56E2142+3514994]
(No symbol) [0x00007FF7A5300CE2]
(No symbol) [0x00007FF7A51A76AA]
(No symbol) [0x00007FF7A51F1860]
(No symbol) [0x00007FF7A51F197C]
(No symbol) [0x00007FF7A5234EE7]
(No symbol) [0x00007FF7A521602F]
(No symbol) [0x00007FF7A52328F6]
(No symbol) [0x00007FF7A5215D93]
(No symbol) [0x00007FF7A51E4BDC]
(No symbol) [0x00007FF7A51E5C64]
GetHandleVerifier [0x00007FF7A570E16B+3695259]
GetHandleVerifier [0x00007FF7A5766737+4057191]
GetHandleVerifier [0x00007FF7A575E4E3+4023827]
GetHandleVerifier [0x00007FF7A54304F9+689705]
(No symbol) [0x00007FF7A530C048]
(No symbol) [0x00007FF7A5308044]
(No symbol) [0x00007FF7A53081C9]
(No symbol) [0x00007FF7A52F88C4]
BaseThreadInitThunk [0x00007FFBF39F7344+20]
RtlUserThreadStart [0x00007FFBF3C026B1+33]