Locale.strxfrm() confusion --- what is it doing & why is it doing it?

smontanaro · October 22, 2022, 12:03am

Just for grins I thought I’d look at a random libref section to see if I could make small improvements to the text. I executed:

(echo -n https://docs.python.org/3.12/library/
 curl -s https://docs.python.org/3.12/library/index.html \
 | egrep 'toctree-l2' \
 | shuffle \
 | head -1 \
 | sed -e 's/.*href="//' -e 's/".*//') \
| xargs open

(shuffle is a little homegrown script that shuffles stdin.)

At the moment, I’m working my way through the locale module docs. I came across this:

locale.strxfrm(string)
    Transforms a string to one that can be used in locale-aware comparisons. For example,
    strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0. This function can be used
    when the same string is compared repeatedly, e.g. when collating a sequence of strings.

That seemed straightforward, but why was it necessary? I gave it a whirl at the REPL:

>>> locale.strxfrm("abcdef")
'½ÅÆÈÉÎ\x01½ÅÆÈÉÎ'
>>> locale.strxfrm("mnop")
'ÙÚÜã\x01ÙÚÜã'

WTF? Can someone explain this to me? What does strxfrm transform the input string to? For reference, I’m running 3.12.0a0 in this locale:

>>> locale.setlocale(locale.LC_ALL, "")
'en_US.UTF-8'

MRAB · October 22, 2022, 1:37am

As for what it’s for, sorting comes to mind.

cameron · October 22, 2022, 1:50am

(shuffle is a little homegrown script that shuffles stdin.)

I’ve got once of them too, with the same name!

At the moment, I’m working my way through the locale module docs. I came across this:
locale.strxfrm(string)
   Transforms a string to one that can be used in locale-aware comparisons. For example,
   strxfrm(s1) < strxfrm(s2) is equivalent to strcoll(s1, s2) < 0. This function can be used
   when the same string is compared repeatedly, e.g. when collating a sequence of strings.
That seemed straightforward, but why was it necessary? I gave it a whirl at the REPL:
>>> locale.strxfrm("abcdef")
'½ÅÆÈÉÎ\x01½ÅÆÈÉÎ'
>>> locale.strxfrm("mnop")
'ÙÚÜã\x01ÙÚÜã'
WTF? Can someone explain this to me? What does strxfrm transform the input string to? For reference, I’m running 3.12.0a0 in this locale:
>>> locale.setlocale(locale.LC_ALL, "")
'en_US.UTF-8'

In this locale it may mean little.

As I read it, it transforms a string from some locale (which may have
special collation order rules) into another string whose naive lexical
comparison orders it correctly with other strings transformed the same
way. Which would let you use it as a key function to order things like
indices in the locale’s native ordering:

  sorted(locale_strings, key=lambda s: locale.strxfrm(s))

Cheers,
Cameron Simpson cs@cskk.id.au

smontanaro · October 22, 2022, 4:04pm

Thanks. I guess what surprised me was the transformation of a string of plain ASCII characters didn’t look anything like the original string. Didn’t even have any elements from the original string.

MRAB · October 22, 2022, 5:05pm

When I tried it I got the original string back until I used setlocale.

Topic		Replies	Views
Python's sorted(x, key=functools.cmp_to_key(locale.strcoll)) comparison on 'æøå' unexpected (norwegian letters) Python Help	11	654	October 12, 2023
How to use f string format without changing content of string? Python Help help	21	3727	June 20, 2022
Doubt in String Slicing Python Help help	10	1369	July 7, 2022
Requesting a code review Python Help help , new-users	32	1708	December 25, 2022
Advice needed on the comparison of these two simple codes Python Help	3	265	October 25, 2023

Locale.strxfrm() confusion --- what is it doing & why is it doing it?

Related Topics