Hi
I am learning Python and need to parse data from a website like headhunter. I have got the sitemap(https://hh.ru/sitemap/main.xml) from the site with a lot of urls, every url contains even more urls, which lead you to the website. My task is to only find urls which will lead me to pages with a text like “working from home”.
I dont understand how to open and filter so much urls. please help and explain if you can.
I’d have a queue of URLs to visit i.e. a list, which initially contains
just your starting URL above. Then a simple loop which runs until the
list is empty:
q = ['https://hh.ru/sitemap/main.xml']
while q:
url = q.pop(0)
Then use the urllib
module to fetch the content of url
. Pay
attention to the Content-Type
of the fetched data.
A couple of test fetches suggest to me that URLs with a content type of
text/xml
contain XML with lists of other URLs. If you get one of them
parse the XML data using the xml.etree.Elementtree.parse
function and
append all the URLs to your list q
.
The parse function is documented here:
If you get a text/html
content type and the URL commences with
https://hh.ru/vacancies/
then it seems to be a job posting. If you get
one of these, I’d suggest parsing its data with the third party library
beautifulsoup4
: beautifulsoup4 · PyPI
Then you can search its text for the terms you want, eg “remote work”.
Things to consider: keep a set
containing URLs you’ve already visited.
You can start your loop by skipping such URLs so that loops in the index
do not cause you to walk the site forever:
seen = set()
q = ['https://hh.ru/sitemap/main.xml']
while q:
url = q.pop(0)
if url in seen:
continue
seen.add(url)
... process the URL ...
The other thing it to avoid walking off the site. Just ignore URLs not
starting with https://hh.ru/
:
seen = set()
q = ['https://hh.ru/sitemap/main.xml']
while q:
url = q.pop(0)
if not url.startswith('https://hh.ru/'):
continue
if url in seen:
continue
seen.add(url)
... process the URL ...
Cheers,
Cameron Simpson cs@cskk.id.au