Make unicodedata.normalize a str method

ChrisBarker-NOAA · October 25, 2024, 5:39pm

If folks need to normalize their strings, they can call:

import unicodedata
my_string = unicodedata.normalize('NFC', my_string)

Which is great – however, now that str is (and has been for a LONG time) Unicode always – it would be nice if normalize was a str method, so you could simply do:

my_string = my_string.normalize('NFC')

or even more helpful:

a_string.normalize('NFC') == another_string.normalize('NFC')

I think this goes beyond simply saving some people some typing:

As a rule, many (most?) Python developers (or any developers!) aren’t all that aware of normalized forms in Unicode, and what they mean.

But it’s an important idea, and often critical in code, to work with normalized forms.

So I think it would be very helpful if the concept, and the code, was more exposed – having to dig into unicodedata to get it makes it much less likely for people to find it without a proper search – e.g. they have to know what they are looking for.

Whereas, if it is a str method, then folks are far more likely to notice it when looking at the str docs, and maybe ask them selves “what the heck is normalize?” – and that’s a good thing.

And the saved typing is nice, too

and maybe is_normalized as well.

Thoughts?

NeilGirdhar · October 26, 2024, 3:12am

This is a good point, but why not just move the function from unicodedata to string? This follows the general OO design principle that functions that can be implemented using the public interface of an object should not be methods, but rather bare functions. For a beautiful interface, see for example the Array API that roughly follows this principle.

ncoghlan · October 26, 2024, 3:51am

Maybe the string method should only do full-fledged NKFC normalisation (the form Python uses for identifier comparisons), and point to the existing stdlib API for the other normalisation forms?

At the very least, the string method should default to NKFC since that’s the right choice for internal-to-Python use cases like getattr.

(Idea prompted by python - Identifier normalization: Why is the micro sign converted into the Greek letter mu? - Stack Overflow, which mentions the reference to NKFC in the language spec at 2. Lexical analysis — Python 3.13.0 documentation)

ChrisBarker-NOAA · October 26, 2024, 4:00am

Is that the case here? does it need anything internal? (at least to be fast in C)

But anyway, I’d say most of the methids on strings are similar – strings are immutable, so pretty much any method could act on it as a Sequence of characters, which is (part of) the public API.

ChrisBarker-NOAA · October 26, 2024, 4:08am

yes, that could be a fine default, but please keep the other forms as an option.

other systems require different normalizations, and it’s my personal opinion that NKFC wasn’t the best choice for Python names (though I get it, and it’s not changing), never mind all the other uses one might have for normalization.

NOTE: I thought of this because I’m working on standards for the netcdf file format – its spec says that all names should be NFC normalized – this is so that comparisons will work. But they don’t want to be heavy handed enough to say that two different Unicode charactoer that “mean” the same thing should be normalized – as a science focused format, for instance, if someone used “Black Board Bold” in a name, they are doing that to distinguish it from regular characters. (not that recommend that, but still …)

ncoghlan · October 26, 2024, 4:16am

There was a security aspect to that decision. Confusable glyphs are bad enough, but genuinely identical glyph sequences representing different identifiers based on how they’re stored internally? Very, very, not fun to audit (or even debug).

You’ve persuaded me that allowing selection of the other normalisation forms via the method API would make sense, though.

malemburg · October 26, 2024, 10:21am

I suppose you meant “string method”.

The reason why we have unicodedata as a separate module is that it’s a rather big module and not something that we want to import at Python startup time.

NeilGirdhar · October 26, 2024, 1:25pm

Yes, that’s true, most string methods should probably have been free functions. But do we really want to convert free functions into methods unnecessarily?

No, I meant a free function in the package.

Ah, fair enough.

gcewing · October 26, 2024, 11:44pm

Ironically, the string methods originally were free functions in the
string module, but during the type-class unification they were shifted
to being methods of the string type.

I can’t remember all the reasons for that, but it does mean they’re
readily available without having to import them. Also, while in
principle they could work on any sequence of chars, for efficiency their
C implementations were and are tightly coupled to the internals of the
string type.

ncoghlan · October 26, 2024, 11:49pm

The method spelling also eliminated operand order ambiguity concerns in some of the operations that accept additional strings.

boludoz · October 27, 2024, 5:28am

I completely agree with this proposal, sometimes people don’t take into account that English is only one of the languages that exist in the world.

NeilGirdhar · October 27, 2024, 12:40pm

Would you mind elaborating on this? I don’t understand

ChrisBarker-NOAA · October 27, 2024, 5:10pm

I think the idea is that for a string manipulating method that takes more than one string as parameters, it can be non-obvious which one is which, e.g. string.replace(str1, str2, str3)

You want replace the occurrences of str2 that are found in str1, with str3

But it can be pretty unclear which is which.

Granted, If you give them reasonable names:

string.replace(original, old, new)

it’s pretty obvious, but not as obvious as if it’s a method, then it’s very clear which string is the “original”.

For the topic at hand:

unicodedata.normalize(form, unistr)

Those are both string inputs – so can be confusing, but:

str.normalize(form)

Is unambiguous.

NOTE: unicodedata.normalize(form, unistr) actually is backward from the “traditional” order, and the opposite of what a method would use, but oh well.

I’ve been coding Python since before string methods, and to my mind, the great benefit of having methods, rather than functions, is so you can easily chain them:

new_str = “,”.join(old_str.capitalize().strip().split())

write that with functions, and you’ll see what I mean

This is also why I like pathlib so much – I got really tired of calling multiple calls to sys.path.*

All this is a bit beside point now – str has a lot of methods, that was decided a long time ago, if we’re going to moce this one, it makes sense to make it a method.

ChrisBarker-NOAA · October 27, 2024, 5:17pm

I’m not sure if this is an argument against moving this function – or just why unicodedata isn’t all in the string module at this point.

But if it is, it seems simple enough to make it an optional import, along the lines of:

def normalize(self, form):
    """Add doctring here"""
    import unicodedata
    return unicodedata.normalize(form, self)

Rosuav · October 27, 2024, 5:19pm

I basically treat that function like plugging in a USB stick: I’ll always get the arguments in the wrong order at least once, possibly twice.

ChrisBarker-NOAA · October 27, 2024, 7:23pm

Then lets shift it to USB-C …

encukou · October 28, 2024, 9:36am

It could be confusing to use the unqualified name normalize for Unicode normalization.

When I normalize strings, unicodedata.normalize is usually just one building block, along with lowercasing/case folding, replacing runs of non-alphanumerics by dashes/underscores, handling empty strings, or removing diacritics (for languages where this isn’t offensive).
(If you go far enough, it starts making sense to call this slugify rather than normalize. The proper solution has so many knobs that it’s often easier to roll my own.)

Some identifier normalization doesn’t need to use unicodedata.normalize at all. For an example, see the PyPA spec for package names: non-ASCII names are declared invalid. (Which illustrates another point: your normalization needs to match your validation.)

Making unicodedata.normalize easier to find would be nice, but we should be careful to not suggest that it’s all you need.

Perhaps it would be useful to add something specifically for Python identifiers, for example:

str.as_identifier(*, allow_keywords=False)
Return the normalized form of a Python identifier, suitable for comparison.
Raise ValueError if the string not a valid identifier (see str.isidentifier).
If allow_keywords is False (the default), also raise ValueError if the string is a Python keyword (see keyword.iskeyword).

But maybe ast would be a better place for that.

IMO, implementing a standard that specifically asks for NFC normalization is not a good case for a built-in method. For this specifically, unicodedata is a pretty good home for the normalize function.

ChrisBarker-NOAA · October 28, 2024, 6:55pm

Hmm – as str is indeed always Unicode, would it be that confusing?

anyway, if you really think so, then we could call it unicode_normalize or something.

Agreed, there are a lot of ways one might normalize a string, but all the others are pretty use-case specific, and unlikely to be added to the str object. And as you say, if anything is added to the stdlib, it should probably be “sluggify”, rather than “normalize”.

Absolutely – do you think having a “normalize” method on str would be suggesting that? I don’t.

ast doesn’t already ahve something like that? Then yes, it should

implementing a standard that specifically asks for NFC normalization is not a good case for a built-in method. For this specifically, unicodedata is a pretty good home for the normalize function.

That wasn’t so much a case, and an example – this is netCDF, used by scientists that are generally not trained as software developers and not up on the specifics of Unicode – this was just one data point of a use case where it would be helpful for normalize to be more easily discoverable.

In this particular case, I’ve summited an issue to the netCDF4-python lib to have it auto-normalize variable and attribute names – and for that code (i.e. a lower level library) , unicodedata is a fine place for it to live.

But while it’s a “detail of implementation”, Unicode normalization (or lack thereof) can bite people in the butt all too often, it would be good to raise its prominence.

Nineteendo · October 28, 2024, 7:13pm

Javascript already uses this name: String.prototype.normalize() - JavaScript | MDN
As well as .NET: String.Normalize Method (System) | Microsoft Learn

ChrisBarker-NOAA · October 29, 2024, 6:44pm

chat has slowed – I think I’ve gotten positive to nuetral feedback.

I don’t think this would require a PEP[*] – but it would require support fro the core devs.

Is there a core dev interested enough to move this forward?

[*] If it does take a PEP, I for one don’t have the bandwidth for that – this just doesn’t matter that much to me.