The current implementation of the function unicodedata.normalize() returns a new reference for the input string when the data is already normalized. It is fine for instances of the built-in str type, whose values are guaranteed to be immutable. However, instances of classes inherited from str are not the case; their fields may be modified after instantiation. This may lead to cause unexpected sharing of modifiable objects with user-defined str sub-classes, along with the function’s implementation:
In the example above, both “original” and “verified” point the same object since the data of the original don’t contain characters to be normalized so that unicodedata.normalize() returns a new reference for it. On the other hand, if the passed object is subject to normalize, the return will be the exact str type, not the inheriting one. This behavior is somewhat inconsistent and confusing.
The solution would be to use the PyUnicode_FromObject() API for early returns in the normalize() function implementation instead of Py_NewRef() to make sure that the function always returns an instance of the built-in str type, aligning with the builtin methods for str (e.g., str.replace(), str.removeprefix(), etc.). One concern is that this change possibly induces a slight performance loss due to additional class checks with function call in the early-return cases but I think improving consistency and correctness is more important.
To be clear, PyUnicode_FromObject() just returns a new reference without cloning the whole string if the object is an instance of the builtin str type. So the optimization still remains for the non-inherited str; the only additional overheads are a type check and a function call (assuming Py_NewRef() is always inlined).
I don’t think such an optimization for subtypes of str is that important where bug-inducing inconsitency is involved.
Currently, the builtin str methods seem to consistently return an exact str object even in the cases where modification is not necessary. I wonder what rationale is there behind the difference, i.e, str.removeprefix() always returns a builtin str object while unicodedata.normalize() does not. Possibly, is this because one is a method and another is a function? (actually, some functions like abs() work like “methods” though). Or do you think builtin methods should apply the same optimization to subtypes of str as well? I’d like to know your thoughts.
Probably because they were written at different times by different authors with different priorities. str subclasses are already a pretty niche thing that most people don’t need, and the exact behavior of these methods matter even less most of the time.
The author of normalize decided that this optimization is worth it and/or is easy to implement, probably because for some categories of strings it is very free because the representation of the string immediately tells you if anything needs to be done (i.e. this is the case for pure ascii strings).
Not copying the string allows performing normalize in O(1) for a vast amount of strings by avoiding the O(n) copy operation.
I doubt changing the behavior of normaliuze is going to be popular unless you can provide really good arguments for it (“consistency” and abstract correctness are not going to be enough - you probably would need to show a decent amount of real world code that is actively broken with this being the central failure point).
Changing the behavior of str methods seems more possible to me IMHO, but would be a pretty large change that would probably require a PEP.
So the end result is: call str if you are working with mutable str subclasses and you need to make sure a value is actually immutable now.
I feel that it’s likely to be difficult than expected to reach a consensus on the desired behavor of functions involving str-subtypes. Perhaps, the problem is not limited to unicodedata.normalize().
As you noted, subtyping of the builtin str is relatively a niche thing, even though it’s definitely a part of Python and some features in the standard library (e.g. StrEnum) rely on it. Not many people would care about the actual behavior of str subtypes. I personally don’t want to use them especially since I know their inconsistency on some edge cases. The workaround you suggested, that is, to make sure that the object is converted to a genuine str object before or after the function call, makes sense.
Still, I think further discussion towards a unified principle on return types for functions involving str-subtypes (or subtypes of other builtin immutables) is needed even though it is minor.
Please open an issue. This is a bug. The type of the result should not depend on the value. Optimization (and this is metely an optimizatio) should not change the type of the result.
Agreed, but changing this now, 20 years after the function was added to Python is going to be difficult. I’d still suggest to open a ticket. Perhaps we can make this change by going through a deprecation period.
In general, object constructors and methods should only return self for strict instances of the types in question. The situation is a bit different for functions which work on objects. Functions may choose to apply the optimization or not, regardless, of whether they are dealing with strict instances of a type or subclasses.
In your case and for now, it may be better to use the str() wrapper, @Rosuav mentioned.