Learing how to scrape/parse data from lists of urls. Need help

MAmOng0 · March 10, 2023, 11:53am

Hi
I am learning Python and need to parse data from a website like headhunter. I have got the sitemap(https://hh.ru/sitemap/main.xml) from the site with a lot of urls, every url contains even more urls, which lead you to the website. My task is to only find urls which will lead me to pages with a text like “working from home”.
I dont understand how to open and filter so much urls. please help and explain if you can.

cameron · March 11, 2023, 1:17am

I’d have a queue of URLs to visit i.e. a list, which initially contains
just your starting URL above. Then a simple loop which runs until the
list is empty:

 q = ['https://hh.ru/sitemap/main.xml']
 while q:
     url = q.pop(0)

Then use the urllib module to fetch the content of url. Pay
attention to the Content-Type of the fetched data.

A couple of test fetches suggest to me that URLs with a content type of
text/xml contain XML with lists of other URLs. If you get one of them
parse the XML data using the xml.etree.Elementtree.parse function and
append all the URLs to your list q.

The parse function is documented here:

If you get a text/html content type and the URL commences with
https://hh.ru/vacancies/ then it seems to be a job posting. If you get
one of these, I’d suggest parsing its data with the third party library
beautifulsoup4: beautifulsoup4 · PyPI

Then you can search its text for the terms you want, eg “remote work”.

Things to consider: keep a set containing URLs you’ve already visited.
You can start your loop by skipping such URLs so that loops in the index
do not cause you to walk the site forever:

     seen = set()
     q = ['https://hh.ru/sitemap/main.xml']
     while q:
         url = q.pop(0)
         if url in seen:
             continue
         seen.add(url)
         ... process the URL ...

The other thing it to avoid walking off the site. Just ignore URLs not
starting with https://hh.ru/:

     seen = set()
     q = ['https://hh.ru/sitemap/main.xml']
     while q:
         url = q.pop(0)
         if not url.startswith('https://hh.ru/'):
             continue
         if url in seen:
             continue
         seen.add(url)
         ... process the URL ...

Cheers,
Cameron Simpson cs@cskk.id.au