Python webscraping : table in the same url, but in multiple "pages"

michaeldg_94 · January 17, 2024, 1:52pm

Hi everyone,

I need your help. I have to extract some information provided in a table format, in this website.

Basically, I want to extract the data that are at the middle of the page (you have to scroll down a bit). There are 31 pages, but I have never done webscraping in the past, even less so on Python.

Could anyone help me with that, please?

Thank you very much in advance for your help!
Best,

Michael

barry-scott · January 17, 2024, 6:41pm

Look into using selenium to load the page and extract information from the page.

I’m assuming that you will have to use selenium becuase the data you want will only appear after javascript code has run.

michaeldg_94 · January 18, 2024, 7:46am

Hi @barry-scott,

Thank you so much for your help.
I will try to use selenium and get back to this post if I am stuck.

Lovely day.
Michael

michaeldg_94 · January 18, 2024, 9:11am

Well, I tried another type of code, as selenium seems too complicated for me, as I am a novice on Python.

Here is what I tried:

import pandas as pd
import time
from selenium import webdriver
import bs4 as bs
import urllib.request


source = urllib.request.urlopen('https://elpais.com/clima-y-medio-ambiente/2023-02-20/el-mapa-de-los-macroproyectos-de-energia-renovable-viaje-al-proximo-boom-solar-y-eolico-en-espana.html').read()
soup = bs.BeautifulSoup(source, 'lxml')

tables = soup.find_all('table')
rows = tables.find_all('tr')
cols = rows.find_all('td') 
cols = [item.text.strip() for item in cols] 
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['1','2', '3', '4', '5', '6', '7'])
df = df.iloc[1:]

print(df)

But I obtained the following error and I don’t understand what’s the problem. Could anyone help me, please?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[106], line 5
      2 soup = bs.BeautifulSoup(source, 'lxml')
      4 tables = soup.find_all('table')
----> 5 rows = tables.find_all('tr')
      6 cols = rows.find_all('td') 
      7 cols = [item.text.strip() for item in cols] 

File ~\AppData\Local\anaconda3\Lib\site-packages\bs4\element.py:2428, in ResultSet.__getattr__(self, key)
   2426 def __getattr__(self, key):
   2427     """Raise a helpful exception to explain a common code fix."""
-> 2428     raise AttributeError(
   2429         "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
   2430     )

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Thank you!
Michael

barry-scott · January 18, 2024, 9:39am

To use BeautifulSoup the data you want from the webpage must be static.

In other words must not require javascript to be run to generate the page in the browser.

In your case the page does not contain the data I believe.
If I am correct then you cannot use BeautifulSoup and will need to learn selenium.

If you print out the page data in source you can confirm this.
And that is reason you see an error.

michaeldg_94 · January 18, 2024, 9:42am

Thank you for your feedback and information.

Could you please explain me what do you mean by:

If you print out the page data in source you can confirm this.
And that is reason you see an error.

In particular, what is source? Sorry if this seems basic to you, but for me it is not.
Thank you in advance for your time.

Michael

michaeldg_94 · January 18, 2024, 9:45am

Well in reality, I tried selenium, but I confess that I give up quite easily.

The reason is that I don’t understand my mistake in the code below:

driver = webdriver.Chrome()
driver.get("https://elpais.com/clima-y-medio-ambiente/2023-02-20/el-mapa-de-los-macroproyectos-de-energia-renovable-viaje-al-proximo-boom-solar-y-eolico-en-espana.html")
l = driver.find_element("xpath", //*[@id="chart"]/div/div/div[2]/table/tbody/tr))

I obtained the following error:

 Cell In[94], line 2
    l = driver.find_element("xpath", //*[@id="chart"]/div/div/div[2]/table/tbody/tr))
                                                                                    ^
SyntaxError: unmatched ')'

Thank you again for your help!

barry-scott · January 18, 2024, 9:45am

It is this variable in your code.

michaeldg_94 · January 18, 2024, 9:46am

Ok, I understand now what is source. Thank you so much!

It is the variable that I created. I thought it was something more complicated… Sorry!

barry-scott · January 18, 2024, 9:49am

You get the error because, i guess you should pass the xpath as a string.

I have not use xpath in selenium, but a web search found this that shows the syntax: How to select element using XPATH syntax on Selenium for Python? - Stack Overflow

michaeldg_94 · January 18, 2024, 9:53am

Does selenium deal with “dynamic” tables?

I am sorry if dynamic is not the right word, but I mean tables that expand over multiple pages, but the URL link stays the same.

Thank you.

barry-scott · January 18, 2024, 10:00am

With selenium you can operate the webpage from code that does clicks and typing using selenium API calls.

That is the way that you can get a page to update.

Beware that page updates are not instant.
Often you need to wait for the page to update.

michaeldg_94 · January 18, 2024, 10:02am

Ok.
Let’s see what I can make with all that…
Thank you!

Michael

barry-scott · January 18, 2024, 10:06am

Oh and beware that once you get this all working it will take one small change by the website owner to break your code’s assumptions.

michaeldg_94 · January 18, 2024, 10:08am

I’ve tried enclosing my xpath in " ", and that doesn’t seem to work either: When you wrote string, did you mean str()? I’m still not used to Python notation, sorry.

from selenium.webdriver.common.by import By
l = driver.find_element(By.XPATH, "//*[@id="chart"]/div/div/div[2]/table/tbody/tr"))

Do you have any other suggestions appart from using xpath(), please?
I would be more than happy to know your opinion, please.

My idea as a beginner is to take each line, make a loop, append them, and do this for all 31 pages.

But I’m not quite sure how to go about it.

Thank you.

Michael

barry-scott · January 18, 2024, 10:11am

The stackoverflow article shows an xpath api being used.
Suggest you read the selenium docs and web search for using xpath.
I do not have experience if using xpath.

michaeldg_94 · January 18, 2024, 1:36pm

I tried this with selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = "https://elpais.com/clima-y-medio-ambiente/2023-02-20/el-mapa-de-los-macroproyectos-de-energia-renovable-viaje-al-proximo-boom-solar-y-eolico-en-espana.html"

# Start the web driver
driver = webdriver.Chrome()

# Navigate to the URL
driver.get(url)

# Wait for the table to load (adjust the timeout as needed)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located((By.XPATH, "//table[@class='medium datawrapper-o9C5O-41clw5 svelte-mcdlwv striped compact resortable']")))

data_list = []

while True:
    # Extract data from the current page
    rows = table.find_elements(By.XPATH, "//table[@class='medium datawrapper-o9C5O-41clw5 svelte-mcdlwv striped compact resortable']//tr")
    
    for row in rows:
        columns = row.find_elements(By.TAG_NAME, "td")
        data_list.append([col.text.strip() for col in columns])
    
    # Find the next page button
    next_button = driver.find_element(By.XPATH, "//a[@class='pagination export-hide datawrapper-o9C5O-xyhjtd svelte-1ya2siw']")

    # Check if there is a next page
    if next_button:
        # Click the next page button
        next_button.click()
    else:
        break  # Exit the loop if there is no next page

# Close the web driver
driver.quit()

# Convert the data_list to a Pandas DataFrame
df = pd.DataFrame(data_list, columns=['Column1', 'Column2', 'Column3', ...])

# Save the DataFrame to an Excel file
df.to_excel('output_data.xlsx', index=False)

But again, I obtained an error back:

---------------------------------------------------------------------------
TimeoutException                          Traceback (most recent call last)
Cell In[1], line 18
     16 # Wait for the table to load (adjust the timeout as needed)
     17 wait = WebDriverWait(driver, 10)
---> 18 table = wait.until(EC.presence_of_element_located((By.XPATH, "//table[@class='medium datawrapper-o9C5O-41clw5 svelte-mcdlwv striped compact resortable']")))
     20 data_list = []
     22 while True:
     23     # Extract data from the current page

File ~\AppData\Local\anaconda3\Lib\site-packages\selenium\webdriver\support\wait.py:105, in WebDriverWait.until(self, method, message)
    103     if time.monotonic() > end_time:
    104         break
--> 105 raise TimeoutException(message, screen, stacktrace)

TimeoutException: Message: 
Stacktrace:
	GetHandleVerifier [0x00007FF7A56E2142+3514994]
	(No symbol) [0x00007FF7A5300CE2]
	(No symbol) [0x00007FF7A51A76AA]
	(No symbol) [0x00007FF7A51F1860]
	(No symbol) [0x00007FF7A51F197C]
	(No symbol) [0x00007FF7A5234EE7]
	(No symbol) [0x00007FF7A521602F]
	(No symbol) [0x00007FF7A52328F6]
	(No symbol) [0x00007FF7A5215D93]
	(No symbol) [0x00007FF7A51E4BDC]
	(No symbol) [0x00007FF7A51E5C64]
	GetHandleVerifier [0x00007FF7A570E16B+3695259]
	GetHandleVerifier [0x00007FF7A5766737+4057191]
	GetHandleVerifier [0x00007FF7A575E4E3+4023827]
	GetHandleVerifier [0x00007FF7A54304F9+689705]
	(No symbol) [0x00007FF7A530C048]
	(No symbol) [0x00007FF7A5308044]
	(No symbol) [0x00007FF7A53081C9]
	(No symbol) [0x00007FF7A52F88C4]
	BaseThreadInitThunk [0x00007FFBF39F7344+20]
	RtlUserThreadStart [0x00007FFBF3C026B1+33]

Could anyone give me a hand, please?
Thank you.