Python has standard library for parsing HTML named html.parser
. We can inherit HTMLParser
class to parse HTML and define our own method to handle the tag and data. There are two method that has parameter attrs
, handle_starttag
and handle_startendtag
. attrs
parameter currently using list of tuple to handle attribute found in tag. Suppose we have HTML <a href="https://www.cwi.nl/">
then we will have attrs equal to [('href', 'https://www.cwi.nl/')]
. I want to hear from all of you, what if we change this to dict instead so that attrs will equal to {'href': 'https://www.cwi.nl/'}
. The main reason i want this change to dict so we easily get the attribute without traversing the list.
Ignoring the challenge of breaking backwards compatibility here, there’s an issue with HTML itself: it’s a parsing error to include the same attribute name more than once (i.e. <a href="..." href="...">
), but the spec states this is a recoverable error and the first value should be kept.
You could argue that HTMLParser
should do exactly that [1]. But I think it’s conceivable that some using this class wants to see the repeated element in their code (let’s say they’re building an HTML validator!), which means a dictionary wouldn’t work.
which means it needs to be smart about building this dictionary ↩︎
Honestly, i never think about this before. Your answer is right, agree with you.
What do you think if attrs
allow list for its value. Suppose we have HTML <a href="https://www.cwi.nl/" href="https://python.org/">
. And then attrs
become {"href": ["https://www.cwi.nl/", "https://python.org/"]}
and if we have HTML <a href="https://www.cwi.nl/">
then we have {'href': 'https://www.cwi.nl/'}
. This will allow validator to check it. What do you think?
If you have a list of tuples, you can easily create a dict:
attrs_dict = dict(attrs_list)
or (if you want to keep the first occurrence rather than the last one)
attrs_dict = dict(reversed(attrs_list))
or write your own code if you want to map strings to lists of values.
Thanks for your advice! Really helpfull