casefold() is intended to remove all case distinctions in a string, for example, the German letter ‘ß’ which is equivalent to "ss". By this it also should transform the German mutated vowels ‘ä’, ‘ö’ and ‘ü’ in my opinion to its international equivalents ‘ae’, ‘oe’ and ‘ue’.
The casefold method follows the Unicode rules. If you want to change what they are, talk to the Unicode consortium.
Thanks for that information. I didn’t know about those Unicode rules and wondered in my point of view that .casefold() didn’t the whole job, only tranforming ß but not ä, ö and ü. However, good to have a discussion to learn new things ![]()
Also note
- umlauts are the German interpretation, but vowels ä, ë, ï, ö, ü, ÿ have different use in other languages (diaeresis)
- the equivalence to ae, oe, ue is also specifically German, not an international standard
At the time the casefold rules were first written, uppercase ß didn’t exists, and it still isn’t in common usage. For a long time the official recommendations were to use SS when writing a word uppercase. This was then adapted by unicode. At this point this is very unlikely to change because of backwards compatibility.
AFAIK this recommendation never existed for ä, ö, ü, which always had uppercase variants: Ä, Ö, Ü.
I am not sure why you think this would be correct behavior? casefold is not crossword rules.
I understood .casefold() to be able to make strings comparable across national character sets, and in everyday international use, umlauts are converted into their outgoing vowel with an additional e.
The unicode database in Python’s stdlib is not locale-aware. It is limited to UCD (Unicode Character Database). If you need locale-specific folding and sorting, then you have to use a library such as ICU (International Components for Unicode)
In particular, str.casefold doc says “The casefolding algorithm is described in section 3.13.3 ‘Default Case Folding’ of the Unicode Standard.”