Python's sorted(x, key=functools.cmp_to_key(locale.strcoll)) comparison on 'æøå' unexpected (norwegian letters)


In the norwegian alphabet the latin a-z is followed by æøå. Still when i execute 'å'>'ø' python considers the statement to be false.

How can I change this? LC_COLLATE set to no_NO.UTF-8 changes nothing. I run python 3.9.6 on a mac m1 with Ventura.

Have a look at Sorting HOW TO — Python 3.12.0 documentation that may be what you are looking for.

Summarizing the doc and giving a more inductive explanation:

Before we can sort, we must be able to compare.

The default comparison between strings - using, for example, operators like < - does not care about the locale setting. To get such a comparison, we need the strcoll function from the same locale standard library module that you use to set the locale. It’s defined to implement a three-way comparison, like the old-style cmp function. (This way, a single ordinary function represents the logic for all the comparison operators). To use that as a key for sorting, we need the cmp_to_key adapter from functools, which then allows for writing e.g. sorted(my_data, key=functools.cmp_to_key(locale.strcoll)).

Thank you! I appreciate your efforts. I have read this page, but seem to have missed something.

Thank you for elaborating. Trying to follow your explanation I get this result:

>>> x=['å', 'æ', 'ø']
>>> sorted(x, key=functools.cmp_to_key(locale.strcoll))
['å', 'æ', 'ø']

Expected result is ['æ', 'ø', 'å'].

I think that comparison is between Unicode codepoint of character:

>>> s = "æøå"
>>> sorted(s)
['å', 'æ', 'ø']
>>> sorted(s, key=ord)
['å', 'æ', 'ø']
>>> print(*map(ord, s))
230 248 229

Did you try it after

(I cannot test this, as my system does not support this locale.)

Hmm… possibly. The docs says that strcoll() compares two strings according to the current LC_COLLATE setting. As expected the character values returned from locale.strcoll() changes when I change the LC_COLLATE value.

@kknechtel Fair question, but yes:

>>> locale.setlocale(locale.LC_ALL, '')
>>> locale.strcoll('ø','å')
>>> locale.setlocale(locale.LC_COLLATE, 'no_NO.UTF-8')
>>> locale.strcoll('ø','å')
>>> x=['å', 'æ', 'ø']
>>> sorted(x, key=locale.functools.cmp_to_key(locale.strcoll))
['å', 'æ', 'ø']

Afaik (after a quick check in the cpython code) strcoll uses the general C-library function wcscoll – if that function is available – so this issue should be reproducible in a little C program.
(And it is reproducible. I just tried it with a slightlly modified version of the program of the stackoverflow post linked below.)

See also:

So, it appears that wcscoll indeed doesn’t work as expected on MacOS, but Python cannot do much about that?
See: c - The wcscoll function, is marked as poisoned, what do I do? - Stack Overflow (old post, also happens to use no_NO as locale; the wscoll function is indeed poisoned in the MacOS sdk).

If that is so, then the Python docs should have made note of this - at least in a footnote, since this is a pretty confusing issue for users.

Thanks a bunch! :clap:t3:

The problem affects any locales using multibyte collations according to the man page of wcscoll.

Yep, saw that… :confused:

I also tried the codecs module, tinkering with the encoding. But none of that leads to a correct sort order for Norwegian. The codecs work, but none of the special ones reflect the Bokmål order…So, codecs doesn’t help.

But… the stackoverlow post actually also gave a hint about a work-around. Those old post are still valuable! This work-around also happens to be available in Python. Since I didn’t know about this issue and work a lot with different languages, I decided to dig it out a bit more.

You can install PyICU (Python wrapper for Unicode tools) (only <= Python 3.11 unfortunately). Just do a regular pip install (after doing a brew install of icu if you don’t yet have that - see Pypi page).
(I also did the export PATH as indicated on the Pypi page, but this may not be needed for simple use in Python.)


>>> from icu import Locale, Collator
>>> loc = Locale("nb")
>>> loc.getDisplayName()
'Norwegian Bokmål'
>>> col = Collator.createInstance(loc)  # a rule-based collator
>>> x = ['å', 'æ', 'ø']
>>> x.sort(key=col.getSortKey)
['æ', 'ø', 'å']

Also, note that this code does not care about any of the locale-related settings in the OS environment - everything is controlled by the icu calls in Python.

(See: test/ · main · main / pyicu · GitLab for other uses)


Thank you so much! It worked like a charm!!

