Locale.strcoll doc - doesn't mention problems on Darwin

The documentation for locale.strcoll (locale — Internationalization services — Python 3.12.0 documentation) doesn’t mention that locale.strcoll doesn’t work as expected for certain locales on Darwin. (See for instance this thread about Norwegian: Python's sorted(x, key=functools.cmp_to_key(locale.strcoll)) comparison on 'æøå' unexpected (norwegian letters)). It would be nice, I think, if this could at least be mentioned in a footnote, perhaps also with a reference to the work-around of using PyICU for string collations.

Not sure if this should be considered as just a minor doc issue, or as a bug in locale.strcoll itself. It does make the doc pretty confusing when trying to use some other locales, and for regular users it may be pretty hard to find out what’s behind the unexpected behavior. A simple footnote would go a long way of helping them.

2 Likes

It seems from the linked thread that the bug occurs in some internal C libraries that Python is wrapping, that are not part of the Python distribution itself. I guess that the Python devs won’t consider it their responsibility to track down information about those kinds of bugs and propagate it in the Python documentation. However, maybe the documentation could be clearer and more explicit about the reliance on those libraries and exposure to any bugs they might contain.

As it stands, we are only told

The locale module opens access to the POSIX locale database and functionality. The POSIX locale mechanism allows programmers to deal with certain cultural issues in an application, without requiring the programmer to know all the specifics of each country where the software is executed.

The locale module is implemented on top of the _locale module, which in turn uses an ANSI C locale implementation if available.

which is not a lot to go on when something like this comes up.

1 Like

If Apple provides another API that does work, then I’m sure we would consider using that for macOS builds. And if they’ve officially deprecated the use of the API on their platform, we would definitely add a message to the documentation. @nad and @ronaldoussoren are our experts here.

If it’s just a known issue that they’re intending to fix, we wouldn’t necessarily note that in our docs or make code changes to deal with it, but it depends on the circumstances (like how long it’s been broken for, what messaging there’s been around fixes, etc.).

We generally don’t recommend third-party modules as alternatives unless we feel they’re at least as trustworthy as CPython itself. Anything that goes into our docs lends our credibility to other projects, so we can’t just link to something because it seems to work right now. Adding a new dependency is a bigger discussion, but not impossible, and ICU is a pretty good candidate for a couple of cases.

2 Likes

I don’t have a clear picuture of the historical background, but I’m pretty sure this is not caused by an Apple bug. As far as I know this has been in the system for over 10 years (!) and locale-related modifications in the MacOS SDK seem to have been made very deliberately. I also wonder if the underlying implementation of wcscoll (=simply calling wcscmp, ignoring locale) is MacOS specific – I have the impression it is not (see for instance newlib libc, also FreeBSD). But regardless, if locale.strcoll doesn’t work on certain platforms for certain locales, shouldn’t Python then look for a solution in its code?

What strikes me in the CPython implemenation is that unicode strings are always converted to wide-char strings (which seems a bit antiquated perhaps? platform-dependent?) and then wcscoll is called. (At that point errno isn’t checked; I don’t know if this is important or not, but errno could in principle be set, according to the POSIX spec). Is wcscoll correct on all platforms (using any libc) if a locale has multibyte collation units?

As to trustworthiness – I don’t think there can be any doubt that the Unicode Consortium is at least as reputable as the PSF :slight_smile: So, I’d assume there is no problem with ‘icu’ in that regard. (The PyICU wrapper is a separate entity - doesn’t seem to have a direct relation to the Unicode Cons., but seems pretty solid to me. I do understand the hesitation to refer to 3d party modules in the docs - but there are precedents for this. For instance references to the requests and regex packages - recommended for certain usecases - and rightfully so.)

Seems to me that one way to fix this in Python itself, could indeed be to use ‘icu’ as dependency, and then use ‘icu’ methods (or data) to re-implement locale.strcol in a really platform independent way.

I think it’s also reasonable to say: We cannot/don’t want to fix this. But in that case, some kind of note in the docs would really be helpful, since currently the documentation is misleading.

AFAIK the wcscoll(3) API is not deprecated on macOS, but as mentioned earlier in this thread doesn’t work with multi-byte LC_CTYPE locales, which means it doesn’t work for most users because the default LC_CTYPE is AFAIK UTF-8 (at least on all my systems, running various macOS versions).

The documentation for strcoll(3) doesn’t mention this restriction, but seams to suffer from the same issue.

Code used to reproduce:


import locale
print(locale.setlocale(locale.LC_COLLATE, 'no_NO.UTF-8'))

x=['å', 'æ', 'ø']
print(f"In: {x}")

print(f"Out: {sorted(x, key=locale.functools.cmp_to_key(locale.strcoll))}")

This prints:

no_NO.UTF-8
In: ['å', 'æ', 'ø']
Out: ['å', 'æ', 'ø']

The thread linked to earler mentions that the expected output is ['æ', 'ø', 'å'].

The output is unchanged if I apply a crude patch to the locale module that switches to strcol(3) for locale.strcoll.


diff --git a/Modules/_localemodule.c b/Modules/_localemodule.c
index fe8e4c5e30..375cbdc6b6 100644
--- a/Modules/_localemodule.c
+++ b/Modules/_localemodule.c
@@ -349,21 +349,19 @@ _locale_strcoll_impl(PyObject *module, PyObject *os1, PyObject *os2)
 /*[clinic end generated code: output=82ddc6d62c76d618 input=693cd02bcbf38dd8]*/
 {
     PyObject *result = NULL;
-    wchar_t *ws1 = NULL, *ws2 = NULL;
+    char *ws1 = NULL, *ws2 = NULL;
 
     /* Convert the unicode strings to wchar[]. */
-    ws1 = PyUnicode_AsWideCharString(os1, NULL);
+    ws1 = PyUnicode_AsUTF8(os1);
     if (ws1 == NULL)
         goto done;
-    ws2 = PyUnicode_AsWideCharString(os2, NULL);
+    ws2 = PyUnicode_AsUTF8(os2);
     if (ws2 == NULL)
         goto done;
     /* Collate the strings. */
-    result = PyLong_FromLong(wcscoll(ws1, ws2));
+    result = PyLong_FromLong(strcoll(ws1, ws2));
   done:
     /* Deallocate everything. */
-    if (ws1) PyMem_Free(ws1);
-    if (ws2) PyMem_Free(ws2);
     return result;
 }
 #endif

It might be possible to get the correct behaviour using CoreFoundation functions (like CFStringCompareWithOptionsAndLocale), but that has two problems: first of all the extra cost of converting Python strings to CoreFoundation strings, and more importantly causing more problems when using os.fork because Apple’s Cocoa frameworks are known to be problematic when using os.fork without immediately exec-ing a different program.

It’s probably better to just document this limitation.

2 Likes