Change attrs parameter in handle_starttag and handle_startendtag from list of tuple to dict

kira · August 23, 2023, 3:15am

Python has standard library for parsing HTML named html.parser. We can inherit HTMLParser class to parse HTML and define our own method to handle the tag and data. There are two method that has parameter attrs, handle_starttag and handle_startendtag. attrs parameter currently using list of tuple to handle attribute found in tag. Suppose we have HTML <a href="https://www.cwi.nl/"> then we will have attrs equal to [('href', 'https://www.cwi.nl/')]. I want to hear from all of you, what if we change this to dict instead so that attrs will equal to {'href': 'https://www.cwi.nl/'}. The main reason i want this change to dict so we easily get the attribute without traversing the list.

HTML Parser docs

jamestwebber · August 23, 2023, 3:33am

Ignoring the challenge of breaking backwards compatibility here, there’s an issue with HTML itself: it’s a parsing error to include the same attribute name more than once (i.e. <a href="..." href="...">), but the spec states this is a recoverable error and the first value should be kept.

You could argue that HTMLParser should do exactly that ^[1]. But I think it’s conceivable that some using this class wants to see the repeated element in their code (let’s say they’re building an HTML validator!), which means a dictionary wouldn’t work.

which means it needs to be smart about building this dictionary ↩︎

kira · August 23, 2023, 3:37am

Honestly, i never think about this before. Your answer is right, agree with you.

kira · August 23, 2023, 7:36am

What do you think if attrs allow list for its value. Suppose we have HTML <a href="https://www.cwi.nl/" href="https://python.org/">. And then attrs become {"href": ["https://www.cwi.nl/", "https://python.org/"]} and if we have HTML <a href="https://www.cwi.nl/"> then we have {'href': 'https://www.cwi.nl/'}. This will allow validator to check it. What do you think?

storchaka · August 23, 2023, 9:26am

If you have a list of tuples, you can easily create a dict:

attrs_dict = dict(attrs_list)

or (if you want to keep the first occurrence rather than the last one)

attrs_dict = dict(reversed(attrs_list))

or write your own code if you want to map strings to lists of values.

kira · August 23, 2023, 11:15am

Thanks for your advice! Really helpfull

Topic		Replies	Views
json.AttrDict - yes or no? Ideas github	12	1448	August 24, 2023
Best way to sort a list into a dict based off of an attr on list elements Python Help	3	361	February 23, 2023
PEP 712: Adding a "converter" parameter to dataclasses.field PEPs	97	6907	November 28, 2023
Help: can't set attribute in 3.9.14 Python Help	4	902	April 9, 2023
Parse python code Python Help	15	2053	July 3, 2019

Change attrs parameter in handle_starttag and handle_startendtag from list of tuple to dict

Related Topics