Guidance on gh-91447 (findtext returns empty string on integer zero value)

I have a PR open for this issue, and I was wondering what the best way to move forward with it is?

The issue is that findtext’s implementation returns back elem.text or "" which means that any “falsey” values (like 0) would give back an empty string. The documentation states that: “Note that if the matching element has no text content an empty string is returned”.

It seems the intent here would be that only if elem.text is None should it give back an empty string since I would assume that is closer to how “no text content” would be interpreted, and I don’t think 0 or an empty list would typically be interpreted as “no text content”.

The main concern on the PR is that this behavior has been like this for a long time so there is a hesitation to “fixing” it since it could break someone’s code if they are for whatever reason relying on this behavior.

Do we want to put in a fix for that behavior or treat this as purely a documentation issue and simply clarify this behavior? Also, please let me know if there is a better avenue for this sort of question. I tried the core-mentorship mailing list, but I got an automated reply (a couple weeks ago now) that the message is pending approval.

I have a feeling that the Elementtree implementation isn’t really prepared for the text field to be something other than a string or None. I found a C implementation of findtext here which appears to do the check you want (it returns an empty string only if the text attribute is None). So that appears to support your premise that it’s a general-purpose tree structure and the text field can contain any object type. It also suggests that there’s a difference in behavior between the C version and the Python version of the same code. (But it may be a different findtext function; I could not manage to detect a difference between the C and Python code.)

Then again, digging slightly deeper, I found a function element_get_text here which checks whether the text field is a list, and in that case joins its items, assuming they are all strings.

So I have a feeling that the use case of putting non-strings (other than None) in the text attribute isn’t really an intended feature, and just (almost) works by accident. If we start fixing the code to make it clear that it could also be anything else, I worry that we might get more bug reports about cases where this still doesn’t work properly, and then we’d be stuck with fixing those.

From this comment by the author of the issue, they link to older documentation for ElementTree/Element via the effbot website (https://archive.ph/Ca5FG). They use this as evidence in explaining how element tree was originally designed to store hierarchical data structures, not just XML.

When looking at what those docs have to say about the text attribute, it is described as:

The element type also provides a text attribute, which can be used to hold additional data associated with the element. As the name implies, this attribute is usually used to hold a text string, but it can be used for other, application-specific purposes.

It specifically calls out “other application-specific purposes” and says it usually holds text and goes on to say:

The element type actually provides two attributes that can be used in this way; in addition to text, there’s a similar attribute called tail. It too can contain a text string, an application-specific object, or None. The tail attribute is used to store trailing text nodes when reading mixed-content XML files; text that follows directly after an element are stored in the tail attribute for that element:

It again mentions that this text attribute and tail attribute can be a text string, None, or an application-specific object.

Note that some implementations may only support string objects as text or tail values.

And concludes to say that some implementations may only support strings for these attributes.

In addition to that, the current documentation explicitly states something of the sort as well when looking at the text and tail attributes: xml.etree.ElementTree — The ElementTree XML API — Python 3.12.1 documentation

Notably

These attributes can be used to hold additional data associated with the element. Their values are usually strings but may be any application-specific object.

and

Applications may store arbitrary objects in these attributes.

Because of that, I would be inclined to think non-string values are intended to be used and handled in the text attribute.

I hope someone who knows ElementTree better will respond.