High memory consumption for non-ascii "Case operations"(like `.lower`, `.upper` etc.)?

borishpst · October 8, 2023, 7:44pm

When you do “case operation” like .lower, internally there is a difference between processing ascii and non-ascii strings. When it’s ascii, then ascii_upper_or_lower is used, which allocates around str_len bytes for the operation. But when it’s not ascii, case_operation is used, which will allocate sizeof(Py_UCS4) * 3 * str_len for the operation, perform it and then convert the result to an object of the right size(sizeof(Py_UCS[1/2/4] * str_len), freeing the sizeof(Py_UCS4) * 3 * str_len memory, allocated before.

Now, non-ascii variant is used, even if there is one character that is not in the basic ascii table (ord(char) > 127). Which creates situation like this: imagine that you have a string, that has 500k ascii characters and one non-ascii. For .lower operation CPython will allocate 4 * 3 * 500000 bytes which is around 6gb of memory for a .lower on a 500mb string.

Now for the question: what is the reasoning behind this huge memory allocation for non-ascii case and can we do something about it?

jeanas · October 8, 2023, 8:05pm

>>> len("ß")
1
>>> "ß".upper()
'SS'
>>> len("ß".upper())
2

This is just one example. There are many Unicode characters which get transformed into several characters when changing case.

See https://www.unicode.org/versions/Unicode15.1.0/ch05.pdf#G21180 page 44 (emphasis mine):

Change in Length. Case mappings may produce strings of different lengths than the origi-
nal. For example, the German character U+00DF ß latin small letter sharp s expands
when uppercased to the sequence of two characters “SS”. Such expansion also occurs where
there is no precomposed character corresponding to a case mapping, such as with U+0149
N latin small letter n preceded by apostrophe. The maximum string expansion as a
result of case mapping in the Unicode Standard is three. For example, uppercasing U+0390
t greek small letter iota with dialytika and tonos results in three characters.

I guess you could do the operation separately for smaller substrings, making for smaller allocations each time, and concatenate the results.

You could also do some sort of guess for the final length (x times the original length, with x < 3) and reallocate if needed.

Yet another option would be to walk the string once to determine the final length, allocate it, and re-walk to do the actual mapping.

There are many strategies. In general, core devs are conservative when it comes to accepting changes that will increase complexity and also potentially impact performance. If you want to propose something this to be done by default for very large strings, I think you would need to come up with an implementation and some benchmarks.

If you only want to solve your problem in production right now, you could write your own implementation of either strategy in pure Python using bytearrays.

tjreedy · October 9, 2023, 4:57am

Another approach might be to use str.translate with a translation dictionary.

MRAB · October 9, 2023, 5:18am

Most codepoints map to 1 codepoint, a few to 2 codepoints, and very few to 3 codepoints.

pitrou · October 9, 2023, 8:47am

Is this an actual problem or just a theoretical concern?
In other words, is there a real-world situation where one calls .lower on a 500 MB string?

Rosuav · October 9, 2023, 9:02am

I’ve certainly done .casefold on huge strings, which would presumably also allocate 12 bytes per character. It’s a good prerequisite to searching case insensitively for something. That said, though, I’ve never really cared about that sort of temporary usage.

pf_moore · October 9, 2023, 9:09am

I think that’s ultimately the answer here. If you can save the memory cots, without adversely impacting more common cases, and without introducing unacceptable code complexity, I don’t see why such a PR would be rejected. But without an actual proposed implementation, nothing is likely to happen because this simply isn’t a common enough issue to prompt anyone to look at it.

And if the problem isn’t important enough to you to justify writing a PR and waiting for a version of Python with the improvement included, then there’s your answer - nobody else was sufficiently motivated either…

ntessore · October 9, 2023, 9:39am

That’s the quick fix, but will still leave you with about twice the required memory size at the time of the last concatenation. While I agree that having to deal with huge strings in memory is usually a design issue, it would be nice to know that the low-level string operations are as efficient as they can be, mainly in time, but also in memory.

Edit: Not saying this is an important issue, but I do remember being caught out by the memory consumption when processing a big blob of text and thinking “it fits, so I’ll just do it in memory” for a quick and dirty script.

pitrou · October 9, 2023, 9:58am

You wouldn’t concatenate yourself but rely either on io.StringIO or "".join. Both have downsides, but so would have an implementation inside the case operations themselves.

methane · October 9, 2023, 10:11am

We may be able to 2-way approach.

Calculate max(ord(lower(char))) and length.
Do conversion directly into new string body.

This might be bit slower than current code, but avoid temporary RAM completely.

malemburg · October 9, 2023, 10:15am

Just to add some context: The overallocation happening here is to make these operations fast for shorter strings, which are much more common in practice than the larger strings you are discussing.

For larger strings, you are probably better off using special implementations written in e.g. Cython, which first scan the text to determine a more accurate pre-allocation size.

kknechtel · October 9, 2023, 10:40am

Heh, thanks for the reminder… I completely abandoned my work on the C code for that, despite that it still seems entirely doable for me.

borishpst · October 9, 2023, 12:06pm

Yes, this question emerged from a real life use of .lower on a big piece of data to perform case insensitive search. The fix was to do a simple slice, as we didn’t actually need to search in the whole string. I was just wondering if this was some kind of memory misallocation or there was a logic behind this.

Thanks everyone for this great explanation.

cofiem · October 9, 2023, 12:35pm

Does casefold have the same issue?

If this is for “caseless matching”, perhaps casefold would be better?

borishpst · October 9, 2023, 12:49pm

yes, this kind of allocation(4 * 3 * str_len) should be true for .title, .capitalize, .casefold, .swapcase, .lower and .upper

malemburg · October 9, 2023, 1:21pm

Have you tried using the re module for this ? It has an option for case insensitive search.