Dynamic web scrapper

Good morning,

I am currently working on a web scrapper.
The point is:
-the user enter an ICAO (airfield ID 4letters)
-open the web page of the airfield,
-look for certain IDs stored in a list, and write them in a excel file in exact locations that I stored in a list.
-Then save the Excel file,
-take a screenshot of a table ( by it’s ID again) on the website.
-close the web page.

Everything works well, but, my problem is that I have to do that on multiple pages (that are dynamically chose with the ICAO the user entered), on the same website. Some ID’s are similar, but the majority are not. do you have an idea on how could I do? I thought of store all ID’s in different files for each pages.

The script is to help colleagues to scrap datas faster than search, copy, paste etc.

I’m working with selenium and openpyxl.

I hope my message is clear, if you have any question please shoot!

Thank you for reading me

Can give examples please and say what you want to change or fix?

okay, for instance one ID’s looking like that //[@id=“AD2016011515362402580020”] on one airfield page, and like //[@id=“AD2016011515362402580023”] on another airfield page for the same field, for instance “magnetic variation”.

What I would like is a solution to select the same field in all the different pages. Someone suggested me OCR, but I’m not sure it will work.

I hope it is clear.

Those two ids seem to be the same pattern so matching them should be easy.
But then you talk about OCR but that only makes sense if the data is in an image and not as text.
Sorry still do not understand the problem you face.

Every ID’s have the same pattern at the beginning yes, but it is the same pattern for almost all elements on the website. Just the two last numbers changes.
For instance, Magnetic variation for airfield1 is “AD2016011515362402580020
and for airfield2 is " AD2016011515362402580027".
But I can also see the exact same pattern for the city name of the airfield or the elevation.

My problem is that I’m looking for a way to collect automatically the data from hundred of pages that looks exactly the same, but with different ID’s for the same element.

Forgot about OCR, someone told me about this, I thought about taking screenshot and then collect data from the screenshot I wrote about this to tell which options I already considered.

Ok so you can use a regex to match that pattern and pull all the ids out of a page.
Do you have an few sample urls of pages with this data on them?
Does the mapping of id to city or airfield happen on the seb page or are you supposed to know that by other means?

here is a sample of urls:
https://ops.skeyes.be/html/belgocontrol_static/eaip/eAIP_Main/html/eAIP/EB-AD-2.EBBE-en-GB.html#AD-2.EBBE
https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_25_JAN_2024/FRANCE/AIRAC-2024-01-25/html/eAIP/FR-AD-2.LFOA-fr-FR.html#AD-2.eAIP.LFOA Note that it can be any EU country or US website. As far as I know, every EU country have almost the same website.
The ID happen on the website, that’s the only way to know them.

I hope I understand your questions well, english is not my mother language.

PS: do you have any example of how can I use RegEx to match the pattern?

So, I tried to change things, here is another way to access the data I need:
/html[1]/body[1]/div[2]/div[1]/div[12]/table[1]/tbody[1]/tr[1]/td[2]
But this is not that change proof, so I try to do something by getting some fix ID’s but it doesn’t work:
/html[1]/body[1]//div[starts-with(@id, “AD-2.EB”)]/div[ends-with(@id, “-AD-2.12”)]/table[1]/tbody[1]/tr[1]/td[1]
does someone have an idea on what is wrong with this?

Thanks in advance

This may be a bit out of place, but for the record if you want to build a database of airfields/airports, there are multiple ones available for free, that directly expose their data source (so no need for scraping):

Hello Alex,
Thank you for your help, I’m not trying to make a database, but to fill excel templates to work faster. I checked your links, but I need very specifics information that are not present in those links.
Thanks again for the help, it will maybe help me in the future.

in fact, I found a solution that will (I hope) be portable for all the websites.
this is what I did:
/html[1]/body[1]//[@id=“AD-2.'+ ICAO + '”]//[@id=“'+ ICAO + '-AD-2.12”]/table[1]/tbody[1]/tr[1]/td[1]

First observation is that these are static web pages so you can avoid the complexity of using selenium and just use requests to load the page data.

This page https://ops.skeyes.be/html/belgocontrol_static/eaip/eAIP_Main/html/eAIP/EB-AD-2.EBBE-en-GB.html#AD-2.EBBE is nicely structured and you should be able to walk its logical HTML structure with something like beautifulsoup4 · PyPI. The IDs you mention are attributes of HTML tags on this page.
I assume that wil be obvious to you have to extract the data you need once you find the <table> for an airport.

The second page https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_25_JAN_2024/FRANCE/AIRAC-2024-01-25/html/eAIP/FR-AD-2.LFOA-fr-FR.html#AD-2.eAIP.LFOA is XML and again you just fetch with requests.
I do not see the IDs on this page, so you will need to figure out the format of this page and have data extraction code that understands it.

I’d assume that you end up with a small number of page designs that you need to detect and run the matching data extraction code on.

Note: If some pages are created with JavaScript then you would need to keep using selenium.

1 Like

I’ve done something similar with tables but not from HTML. I looked at the HTML for AIP for BELGIUM (section AD-2.EBBE) valid from 25 JAN 2024 and at line 9 everything is one line. Here’s how I would start, this is not all the steps you will need, but this describes how I look for data in tables in HTML.

  1. I would start by loading the plain text HTML into a string variable called “original_html”.
  2. Set new_html = original_html to make a copy. We will work with new_html.
  3. In new_html prefix every <table, </table, <tr with \n (CRLF). Now we have one table row per list entry.
  4. Now split on \n to put new_html into a list called “list_html”.
  5. Now you can loop through every entry in list_html. If the entry begins with <tr you have the beginning of a table row and you can search for more data that way.
1 Like

Using beautifulsoup4 · PyPI means you do not need to hack the string. You can process the html logically.

1 Like

Thank you both for your answers, last Friday I finally ended up with something almost working, but I will explore your solutions to find if I can do something more automated.
I’ll come back to you with the way I found is the best.