Change attrs parameter in handle_starttag and handle_startendtag from list of tuple to dict

Python has standard library for parsing HTML named html.parser. We can inherit HTMLParser class to parse HTML and define our own method to handle the tag and data. There are two method that has parameter attrs, handle_starttag and handle_startendtag. attrs parameter currently using list of tuple to handle attribute found in tag. Suppose we have HTML <a href=""> then we will have attrs equal to [('href', '')]. I want to hear from all of you, what if we change this to dict instead so that attrs will equal to {'href': ''}. The main reason i want this change to dict so we easily get the attribute without traversing the list.

HTML Parser docs

Ignoring the challenge of breaking backwards compatibility here, there’s an issue with HTML itself: it’s a parsing error to include the same attribute name more than once (i.e. <a href="..." href="...">), but the spec states this is a recoverable error and the first value should be kept.

You could argue that HTMLParser should do exactly that [1]. But I think it’s conceivable that some using this class wants to see the repeated element in their code (let’s say they’re building an HTML validator!), which means a dictionary wouldn’t work.

  1. which means it needs to be smart about building this dictionary ↩︎

Honestly, i never think about this before. Your answer is right, agree with you.

What do you think if attrs allow list for its value. Suppose we have HTML <a href="" href="">. And then attrs become {"href": ["", ""]} and if we have HTML <a href=""> then we have {'href': ''}. This will allow validator to check it. What do you think?

If you have a list of tuples, you can easily create a dict:

attrs_dict = dict(attrs_list)

or (if you want to keep the first occurrence rather than the last one)

attrs_dict = dict(reversed(attrs_list))

or write your own code if you want to map strings to lists of values.


Thanks for your advice! Really helpfull