When you do “case operation” like .lower, internally there is a difference between processing ascii and non-ascii strings. When it’s ascii, then ascii_upper_or_lower is used, which allocates around str_len bytes for the operation. But when it’s not ascii, case_operation is used, which will allocate sizeof(Py_UCS4) * 3 * str_len for the operation, perform it and then convert the result to an object of the right size(sizeof(Py_UCS[1/2/4] * str_len), freeing the sizeof(Py_UCS4) * 3 * str_len memory, allocated before.
Now, non-ascii variant is used, even if there is one character that is not in the basic ascii table (ord(char) > 127). Which creates situation like this: imagine that you have a string, that has 500k ascii characters and one non-ascii. For .lower operation CPython will allocate 4 * 3 * 500000 bytes which is around 6gb of memory for a .lower on a 500mb string.
Now for the question: what is the reasoning behind this huge memory allocation for non-ascii case and can we do something about it?
Change in Length. Case mappings may produce strings of different lengths than the origi-
nal. For example, the German character U+00DF ß latin small letter sharp s expands
when uppercased to the sequence of two characters “SS”. Such expansion also occurs where
there is no precomposed character corresponding to a case mapping, such as with U+0149
N latin small letter n preceded by apostrophe. The maximum string expansion as a
result of case mapping in the Unicode Standard is three. For example, uppercasing U+0390
t greek small letter iota with dialytika and tonos results in three characters.
I guess you could do the operation separately for smaller substrings, making for smaller allocations each time, and concatenate the results.
You could also do some sort of guess for the final length (x times the original length, with x < 3) and reallocate if needed.
Yet another option would be to walk the string once to determine the final length, allocate it, and re-walk to do the actual mapping.
There are many strategies. In general, core devs are conservative when it comes to accepting changes that will increase complexity and also potentially impact performance. If you want to propose something this to be done by default for very large strings, I think you would need to come up with an implementation and some benchmarks.
If you only want to solve your problem in production right now, you could write your own implementation of either strategy in pure Python using bytearrays.
I’ve certainly done .casefold on huge strings, which would presumably also allocate 12 bytes per character. It’s a good prerequisite to searching case insensitively for something. That said, though, I’ve never really cared about that sort of temporary usage.
I think that’s ultimately the answer here. If you can save the memory cots, without adversely impacting more common cases, and without introducing unacceptable code complexity, I don’t see why such a PR would be rejected. But without an actual proposed implementation, nothing is likely to happen because this simply isn’t a common enough issue to prompt anyone to look at it.
And if the problem isn’t important enough to you to justify writing a PR and waiting for a version of Python with the improvement included, then there’s your answer - nobody else was sufficiently motivated either…
That’s the quick fix, but will still leave you with about twice the required memory size at the time of the last concatenation. While I agree that having to deal with huge strings in memory is usually a design issue, it would be nice to know that the low-level string operations are as efficient as they can be, mainly in time, but also in memory.
Edit: Not saying this is an important issue, but I do remember being caught out by the memory consumption when processing a big blob of text and thinking “it fits, so I’ll just do it in memory” for a quick and dirty script.
Yes, this question emerged from a real life use of .lower on a big piece of data to perform case insensitive search. The fix was to do a simple slice, as we didn’t actually need to search in the whole string. I was just wondering if this was some kind of memory misallocation or there was a logic behind this.