Python's sorted(x, key=functools.cmp_to_key(locale.strcoll)) comparison on 'æøå' unexpected (norwegian letters)

ÆØÅ

In the norwegian alphabet the latin a-z is followed by æøå. Still when i execute 'å'>'ø' python considers the statement to be false.

How can I change this? LC_COLLATE set to no_NO.UTF-8 changes nothing. I run python 3.9.6 on a mac m1 with Ventura.

Regards Abuluntu

Have a look at Sorting HOW TO — Python 3.12.0 documentation that may be what you are looking for.

Summarizing the doc and giving a more inductive explanation:

Before we can sort, we must be able to compare.

The default comparison between strings - using, for example, operators like < - does not care about the locale setting. To get such a comparison, we need the strcoll function from the same locale standard library module that you use to set the locale. It’s defined to implement a three-way comparison, like the old-style cmp function. (This way, a single ordinary function represents the logic for all the comparison operators). To use that as a key for sorting, we need the cmp_to_key adapter from functools, which then allows for writing e.g. sorted(my_data, key=functools.cmp_to_key(locale.strcoll)).

Thank you! I appreciate your efforts. I have read this page, but seem to have missed something.

Thank you for elaborating. Trying to follow your explanation I get this result:

>>> x=['å', 'æ', 'ø']
>>> sorted(x, key=functools.cmp_to_key(locale.strcoll))
['å', 'æ', 'ø']

Expected result is ['æ', 'ø', 'å'].

I think that comparison is between Unicode codepoint of character:

>>> s = "æøå"
>>> sorted(s)
['å', 'æ', 'ø']
>>> sorted(s, key=ord)
['å', 'æ', 'ø']
>>> print(*map(ord, s))
230 248 229

Did you try it after

?
(I cannot test this, as my system does not support this locale.)

Hmm… possibly. The docs says that strcoll() compares two strings according to the current LC_COLLATE setting. As expected the character values returned from locale.strcoll() changes when I change the LC_COLLATE value.

@kknechtel Fair question, but yes:

>>> locale.setlocale(locale.LC_ALL, '')
'C/UTF-8/C/C/C/C'
>>> locale.strcoll('ø','å')
19
>>> locale.setlocale(locale.LC_COLLATE, 'no_NO.UTF-8')
'no_NO.UTF-8'
>>> locale.strcoll('ø','å')
31
>>> x=['å', 'æ', 'ø']
>>> sorted(x, key=locale.functools.cmp_to_key(locale.strcoll))
['å', 'æ', 'ø']

Afaik (after a quick check in the cpython code) strcoll uses the general C-library function wcscoll – if that function is available – so this issue should be reproducible in a little C program.
(And it is reproducible. I just tried it with a slightlly modified version of the program of the stackoverflow post linked below.)

See also:
https://github.com/python/cpython/blob/7dd3c2b80064c39f1f0ebbc1f8486897b3148aa5/Lib/test/test_locale.py#L349

So, it appears that wcscoll indeed doesn’t work as expected on MacOS, but Python cannot do much about that?
See: c - The wcscoll function, is marked as poisoned, what do I do? - Stack Overflow (old post, also happens to use no_NO as locale; the wscoll function is indeed poisoned in the MacOS sdk).

If that is so, then the Python docs should have made note of this - at least in a footnote, since this is a pretty confusing issue for users.

1 Like

Thanks a bunch! :clap:t3:

The problem affects any locales using multibyte collations according to the man page of wcscoll.

Yep, saw that… :confused:

I also tried the codecs module, tinkering with the encoding. But none of that leads to a correct sort order for Norwegian. The codecs work, but none of the special ones reflect the Bokmål order…So, codecs doesn’t help.

But… the stackoverlow post actually also gave a hint about a work-around. Those old post are still valuable! This work-around also happens to be available in Python. Since I didn’t know about this issue and work a lot with different languages, I decided to dig it out a bit more.

You can install PyICU (Python wrapper for Unicode tools) (only <= Python 3.11 unfortunately). Just do a regular pip install (after doing a brew install of icu if you don’t yet have that - see Pypi page).
(I also did the export PATH as indicated on the Pypi page, but this may not be needed for simple use in Python.)

Then:

>>> from icu import Locale, Collator
>>> loc = Locale("nb")
>>> loc.getDisplayName()
'Norwegian Bokmål'
>>> col = Collator.createInstance(loc)  # a rule-based collator
>>> x = ['å', 'æ', 'ø']
>>> x.sort(key=col.getSortKey)
['æ', 'ø', 'å']

:tada:
Also, note that this code does not care about any of the locale-related settings in the OS environment - everything is controlled by the icu calls in Python.

(See: test/test_Collator.py · main · main / pyicu · GitLab for other uses)

3 Likes

Thank you so much! It worked like a charm!!

1 Like