Make unicodedata.normalize a str method

malemburg · October 29, 2024, 7:26pm

Hi Chris, as mentioned before on this topic, adding a string method for this would require importing (or linking to) the Unicode database that’s part of the unicodedata module. Since this is a huge chunk of data, it was split out into a separate module. Adding a tighter binding would have Python be slower on startup and take up more RAM, even when the feature is not used.

As a result, I don’t believe this will fly.

We could probably have the method redirect to the unicodedata module’s function after importing it on first use, but this would hide the above side-effects in a rather non-obvious way.

ChrisBarker-NOAA · October 29, 2024, 8:44pm

agreed that importing all of unicodedata every time is non-starter.

We could probably have the method redirect to the unicodedata module’s function after importing it on first use,

as in importing the module only when the method is called:

def normalize(self, form):
    import unicodedata
    return unicodedata.normalize(form, self)

(or its C equivalent)

but this would hide the above side-effects in a rather non-obvious way.

I suppose so – but there’s no way to normalize a unicode string without that import – who is this going to confuse / affect??

I’m guessing the intersection of folks that would think/care about the overhead of importing unicodedata and who wouldn’t think that calling normalize on a str might have performance implications is tiny

I also wonder if the code could be refactored to need only the parts of the Unicode database that are required by that method – though that’s probably more work than its worth.

I suppose it could go in the string module, as someone suggested on this thread, but I’m not sure that’s enough of an improvement to be worth it. More discoverable, yes, but not much more. (I don’t remember the last time I poked around in the string module)

Thanks for your engagement – as I said, this isn’t important enough to me to make a major effort. There’s good reason to not do this, so this is probably the end of it.

malemburg · October 29, 2024, 9:18pm

Looking at the code, this would probably be possible, since the decomposition tables are not that large. However, the code is also using an optimization for checking whether a string is normalized already and that needs access to the whole database. Even that could be factored out into a separate table, but this would be a rather big project for little gains…

A documentation patch may be the better option to get more attention to the challenges of normalization in Unicode strings.

pf_moore · October 29, 2024, 9:23pm

I thought you’d got neutral to negative feedback, to be honest:

The cost of importing the unicode tables, which you have been reminded of.
The fact that “normalisation” means a lot of things to a lot of people, and Unicode normalisation is only one such (although admittedly the only one that isn’t application or domain specific).
Add to that the fact that there are multiple forms of Unicode normalisation, and the best default (in the sense of it being the form Python uses internally) isn’t obviously the best form for application use.

I think that having to go to the unicodedata module for this is a useful signal that the user needs to have a better understanding of Unicode than the average developer tends to have. Plus, it’s not as if it’s that difficult to add the import to your code.

ChrisBarker-NOAA · October 29, 2024, 9:52pm

well, I guess I’m an optimist

But yes, probably a dead idea.

That’s the problem – I think it’s really important that average developers do know that it exists, and that they may need to do it. As @malemburg said, maybe we need to put somethign in the docs to get there – but I have no idea where in the docs that might go.

ChrisBarker-NOAA · October 29, 2024, 9:56pm

hmm – this: Compare strings the right way

Is a pretty good reference – I may start pointing people to that …

-CHB

ncoghlan · October 30, 2024, 6:59am

That shouldn’t be true if the feature is limited to NKFC normalisation (since the interpreter always needs that when compiling source code). (I had that benefit in mind when I suggested the restriction to only full normalisation, but I never stated that explicitly)

I’m inclined to agree with @encukou that something like str.as_identifier() or str.normalize_source() would be a better name for that, though (.normalize_source() would be a better name if it just did normalisation, with consumers doing their own isidentifier()/iskeyword()/isnumeric() checks after the normalisation step).

And then point over to unicodedata.normalize for any use cases which need a different level of text normalization.

For .lower() vs .casefold(), I wonder if the str.lower docs should gain a counterpart to the second paragraph in [str.casefold](https://docs.python.org/3/library/stdtypes.html#str.casefold). Something like:

Note that lowercasing may not remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since 'ß' is already lowercase, 'ß'.lower() == 'ß'. To reliably remove all case distinctions regardless of language, use casefold() ('ß'.casefold() == 'ss').

encukou · October 30, 2024, 8:57am

Please read my comment above more as “stuff to think about”: friction, rather than opposition to the idea.

IMO, adding a method to a built-in type does need a PEP, but it can be a short one. I can write and sponsor it if there’s a viable proof of concept. As Marc-André says, this would be a rather big project for little gains, technically.

malemburg · October 30, 2024, 9:11am

I am not too familiar with the compiler code, but had a look at pegen.c, which is the PEG parser used by the compiler AFAIK. This does indeed import unicodedata and then uses the normalize function to normalize identifiers using NFKC.

However, it only does this for identifiers which are not ASCII and so this doesn’t happen often in code I run, which is why I usually don’t see the module in sys.modules.

encukou · October 30, 2024, 10:17am

Yeah:

>>> import sys
>>> len(sys.modules)
79
>>> a = 1
>>> len(sys.modules)
79
>>> á = 1
>>> len(sys.modules)
80

or worse:

>>> import sys
>>> sys.path.clear()  # for simplicity -- there are cleaner ways to (temporarily) break imports
>>> a = 1
>>> á = 1
ModuleNotFoundError: No module named 'unicodedata'
>>> # Wat?

That isn’t great, but I guess the compiler & REPL can afford “heavy” dependencies with weird failure modes. IMO, for a built-in str method it would be unacceptable.
OTOH, if we add that built-in method, the compiler could get a bit cleaner!

Nineteendo · October 30, 2024, 10:30am

I haven’t looked at the code, but wouldn’t it be possible to make that optimization exclusive to unicodedata? It could then be faster to call str.normalize() if you know the string isn’t normalized.

def normalize(form, string):
    if _is_normalized(string, form): # check for ASCII, etc.
        return string
    else:
        return string.normalize(form)

malemburg · October 30, 2024, 11:36am

A string method implementation could use just the decomposition tables and use a simplified normalized check, which doesn’t use the entire database (e.g. check for ASCII only chars). This would make the project a little simpler, but still adds complexity, since you now no longer have the Unicode database data in just one file.

Overall, I think str.normalize() would be a nice to have, but given all the complexities involved, I don’t think it’ll happen.

People who need normalization, can simply use the unicodedata.normalize() function. And this may even be better, since normalization sounds like an easy thing to do, but there are many pitfalls associated with it and the results may not always be what you want them to be. E.g. NFKC will turn “ﬂuffy” into “fluffy”, whereas NFC leaves it untouched. The reason is that the “K” part of the normalization makes some “compatibility” assumptions, which may or may not be what you want or expect.

Realizing that some heavy machinery goes into the normalization may make people aware of such potential unwanted effects.

Python uses NFKC for identifiers, which has some interesting implications:

>>> ﬂuffy = 2
>>> fluffy
2
>>> cﬃ = 3
>>> cffi
3

More on this: Unicode equivalence - Wikipedia

ronaldoussoren · October 30, 2024, 4:07pm

W.r.t. to hiding side effects, the precedent here are str.encode and bytes.decode both of which import codecs when they are used (at least for some encodings).

That said, I’d expect that these methods are used a lot more than unicode normalisation.

malemburg · October 30, 2024, 9:31pm

They both use the codecs module (and the encodings package), but the module itself is always imported at startup and small compared to unicodedata.

You could argue that the methods do hide side-effects in that they import the codec implementations from the encodings package (and other packages which register codec search functions), and yes, that’s per design

ChrisBarker-NOAA · October 30, 2024, 10:04pm

Absolutely, but I’m not sure if that’s an argument for or against this proposal. A hidden side effect for a common operation is worse than for an uncommon, I’d think.

If str.normalize (if it existed) is rarely used, then the side effect is less consequential.

And the module would only be loaded when it was used, and in that case, the alternative would be for the user to import the unicodedata module themselves – which means they would know that they were importing something, but I’d suspect most wouldn’t have any idea that it was a substantial import that they might care about (and often not – (most) computers have a lot of memory these days…) And if you need to normalize a string, you don’t have another option.

The fact is that if you want to normalize even one string in an program, you need to load the unicdodedata module. (and if may have been loaded by some third party package without your being aware)

So might a warning in the docs be enough to let people know, for those few that care?

NOTE: this does make me think that a way to temporarily load a module would be cool:

with import unicodedata as unicodedata:
    unicodedata.normalize('NFC', my_string)

Now THAT is a big change – but this is ideas, yes?

Nineteendo · October 30, 2024, 10:11pm

That’s off topic, and I think the time it takes to import it is the main problem (which this makes worse).

ncoghlan · October 31, 2024, 2:35am

A comment from @Rosuav made me wonder “Just how heavy of a dependency is unicodedata really?”

-X importtime suggests it isn’t that bad, but still on the order of 3-4% increase in startup module load time relative to 3.12.6 (with a warm disk cache):

$ python3 -X importtime -c "import unicodedata"
import time: self [us] | cumulative | imported package
import time:       134 |        134 |   _io
import time:        26 |         26 |   marshal
import time:       168 |        168 |   posix
import time:       246 |        572 | _frozen_importlib_external
import time:        61 |         61 |   time
import time:        80 |        140 | zipimport
import time:        24 |         24 |     _codecs
import time:       163 |        187 |   codecs
import time:       237 |        237 |   encodings.aliases
import time:       322 |        745 | encodings
import time:       120 |        120 | encodings.utf_8
import time:        58 |         58 | _signal
import time:        19 |         19 |     _abc
import time:        94 |        113 |   abc
import time:       106 |        219 | io
import time:        23 |         23 |       _stat
import time:        47 |         69 |     stat
import time:       430 |        430 |     _collections_abc
import time:        21 |         21 |       genericpath
import time:        64 |         84 |     posixpath
import time:       223 |        805 |   os
import time:        48 |         48 |   _sitebuiltins
import time:       187 |        187 |   encodings.utf_8_sig
import time:       466 |        466 |   _distutils_hack
import time:        63 |         63 |   sitecustomize
import time:        64 |         64 |   usercustomize
import time:      1207 |       2837 | site
import time:       164 |        164 | unicodedata

Testing with Python 3.13 instead shifted the exact numbers around a bit, but unicodedata still weighed in at around 3-4% of the already imported startup modules.

(Writing this inspired a completely different train of thought, but I’ll put that in a separate reply)

ncoghlan · October 31, 2024, 4:10am

I wonder if we’re looking at this at too low a level. If the goal is “make it easier to compare strings correctly”, perhaps @NeilGirdhar is on to something by suggesting adding an appropriate helper (or helpers) to the string module?

string already imports re, so having it also import unicodedata would be lost in the noise (tangent: it might be nice to see if that re import could be made lazy, though).

For example:

def normalize_text(text, /, *, form="NKFC"):
    """Normalize text using given Unicode normalisation form"""
    import unicodedata
    return unicodedata.normalize(form, text)

def as_python_source(text, /):
    """Normalize text as specified for the compilation of Python source code"""
    return normalize_text(text)

def for_comparison(text, /, *, form="NKFC", casefold=str.casefold):
    """Normalize text for comparison using given form and casefolding method"""
    normalized = normalize_text(form)
    if casefold is None:
        return normalized
    return casefold(text)

def compare_case_sensitive(a, b, /, form="NKFC"):
    ""Compare normalized strings for equality, preserving case distinctions"""
    return normalize_text(a, form=form) == normalize_text(b, form=form)

def compare_case_insensitive(a, b, /, form="NKFC"):
    ""Compare normalized strings for equality, ignoring all case distinctions and forms"""
    return for_comparison(a, form=form) == for_comparison(b, form=form)

malemburg · October 31, 2024, 9:08am

This function name is misleading, since Python’s compiler only normalizes identifiers using “NFKC” and in addition does some extra checks on these, since not all Unicode code points are being accepted as Python identifier string values.

I’d remove it for clarity, since the details are complex and would probably better be exposed in a parser related module (perhaps there already is such an API somewhere).

Now, just to give you an idea of where such an idea would be heading…

Another angle to consider would be string collation and the needed conversions required for sorting. We don’t have collation support in the Python, though. Here are a few references:

UTS #10: Unicode Collation Algorithm - Sorting Unicode, taking national conventions into account
GitHub - jtauber/pyuca: a Python implementation of the Unicode Collation Algorithm

If you want to go beyond just textual comparisons, you have to take i18n aspects into account and even go into areas of NLP (national language processing), so that you can apply stemming, plural/singular normalizations, transliterations, etc. etc.

Unicode CLDR Project - the Unicode Common Locale Data Repository tries to unify aspects of i18n in Unicode
ICU - International Components for Unicode - ICU is a library implementing much of the CLDR (originally by IBM)
PyICU · PyPI - ICU is a wrapper around the IBM Unicode library

But this is getting off-topic for the topic of adding a normalize method

malemburg · October 31, 2024, 9:13am

To add to this: the unicodedata module is a 1.1MB module (on my machine), so increases RAM usage when loaded. VSZ (virtual size) goes up by 1.1MB, RSS (resident size) not that much, since this depends on what you actually use from the Unicode database.