I am learning Python and need to parse data from a website like headhunter. I have got the sitemap(https://hh.ru/sitemap/main.xml) from the site with a lot of urls, every url contains even more urls, which lead you to the website. My task is to only find urls which will lead me to pages with a text like “working from home”.
I dont understand how to open and filter so much urls. please help and explain if you can.
I’d have a queue of URLs to visit i.e. a list, which initially contains
just your starting URL above. Then a simple loop which runs until the
list is empty:
q = ['https://hh.ru/sitemap/main.xml'] while q: url = q.pop(0)
Then use the
urllib module to fetch the content of
attention to the
Content-Type of the fetched data.
A couple of test fetches suggest to me that URLs with a content type of
text/xml contain XML with lists of other URLs. If you get one of them
parse the XML data using the
xml.etree.Elementtree.parse function and
append all the URLs to your list
The parse function is documented here:
If you get a
text/html content type and the URL commences with
https://hh.ru/vacancies/ then it seems to be a job posting. If you get
one of these, I’d suggest parsing its data with the third party library
beautifulsoup4: beautifulsoup4 · PyPI
Then you can search its text for the terms you want, eg “remote work”.
Things to consider: keep a
set containing URLs you’ve already visited.
You can start your loop by skipping such URLs so that loops in the index
do not cause you to walk the site forever:
seen = set() q = ['https://hh.ru/sitemap/main.xml'] while q: url = q.pop(0) if url in seen: continue seen.add(url) ... process the URL ...
The other thing it to avoid walking off the site. Just ignore URLs not
seen = set() q = ['https://hh.ru/sitemap/main.xml'] while q: url = q.pop(0) if not url.startswith('https://hh.ru/'): continue if url in seen: continue seen.add(url) ... process the URL ...
Cameron Simpson email@example.com