Locale.strxfrm() confusion --- what is it doing & why is it doing it?

Just for grins I thought I’d look at a random libref section to see if I could make small improvements to the text. I executed:

(echo -n https://docs.python.org/3.12/library/
 curl -s https://docs.python.org/3.12/library/index.html \
 | egrep 'toctree-l2' \
 | shuffle \
 | head -1 \
 | sed -e 's/.*href="//' -e 's/".*//') \
| xargs open

(shuffle is a little homegrown script that shuffles stdin.)

At the moment, I’m working my way through the locale module docs. I came across this:

locale.strxfrm(string)
    Transforms a string to one that can be used in locale-aware comparisons. For example,
    strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0. This function can be used
    when the same string is compared repeatedly, e.g. when collating a sequence of strings.

That seemed straightforward, but why was it necessary? I gave it a whirl at the REPL:

>>> locale.strxfrm("abcdef")
'½ÅÆÈÉÎ\x01½ÅÆÈÉÎ'
>>> locale.strxfrm("mnop")
'ÙÚÜã\x01ÙÚÜã'

WTF? Can someone explain this to me? What does strxfrm transform the input string to? For reference, I’m running 3.12.0a0 in this locale:

>>> locale.setlocale(locale.LC_ALL, "")
'en_US.UTF-8'

As for what it’s for, sorting comes to mind.

(shuffle is a little homegrown script that shuffles stdin.)

I’ve got once of them too, with the same name!

At the moment, I’m working my way through the locale module docs. I came across this:

locale.strxfrm(string)
   Transforms a string to one that can be used in locale-aware comparisons. For example,
   strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0. This function can be used
   when the same string is compared repeatedly, e.g. when collating a sequence of strings.

That seemed straightforward, but why was it necessary? I gave it a whirl at the REPL:

>>> locale.strxfrm("abcdef")
'½ÅÆÈÉÎ\x01½ÅÆÈÉÎ'
>>> locale.strxfrm("mnop")
'ÙÚÜã\x01ÙÚÜã'

WTF? Can someone explain this to me? What does strxfrm transform the input string to? For reference, I’m running 3.12.0a0 in this locale:

>>> locale.setlocale(locale.LC_ALL, "")
'en_US.UTF-8'

In this locale it may mean little.

As I read it, it transforms a string from some locale (which may have
special collation order rules) into another string whose naive lexical
comparison orders it correctly with other strings transformed the same
way. Which would let you use it as a key function to order things like
indices in the locale’s native ordering:

  sorted(locale_strings, key=lambda s: locale.strxfrm(s))

Cheers,
Cameron Simpson cs@cskk.id.au

Thanks. I guess what surprised me was the transformation of a string of plain ASCII characters didn’t look anything like the original string. Didn’t even have any elements from the original string.

When I tried it I got the original string back until I used setlocale.