A case for a long string representation?

This is a specific scenario, but there are many alike.

A soft-concurrent program (threads or asyncio) is visiting several Web sites. The memory pool is fragmented, but total memory remains low and constant. Then several huge responses are downloaded.

In that case, a statement like this becomes O(MxN) memory (N= size of response, M= number of methods called:

body = text.strip().replace('\x00', '').encode('utf8')

(that’s real-life example, but think about the case in which each call creates a slightly larger result).

The reason is this:

The currently available ways of avoiding the situation are at the expense of losing the convenience of the many methods and functions available for str and bytes.

One possible solution to keep the convenience of str and bytes would be for Python to switch to a “long string (bytes)” representation after the length is above a certain threshold. The long string would be stored in smaller blocks. For some operations the single large block may need to be re-constructed, but memory should keep within O(2xN) per “logical string”, with one of the N consisting of small blocks.

(research done by Heorhii Zatserklianyi)

body = text.strip().replace('\x00', '').encode('utf8')

Could you process the incoming text stream through a buffer window, chunk by
chunk? After all, you have a potential lead-in which you might want to skip
over (i.e. the blanks removed by str.strip) and an optional trail of blanks to
chop off at the end, apart from this the only omission you need to apply is the
NUL byte, based on your code.

Or are you proposing a fluent programming interface .step1().step2() which can
be chained? I think that’s trickier as… if you were looking for example to
remove aaabc you’d want to make sure you don’t split the sequence at the wrong
place, e.g. aa|(buffer-flush,window-advance)|abc, with the pattern not being
detected in the second window due to the partial.

Marco Ippolito

You might be interested in an library like chunked buffer.
Matti

It’s not possible to do:

buffer = ChuunkedBuffer(text)
body = buffer.strip().replace('\x00', '').encode('utf8')

That’s the whole point of my query.

I think you are talking about a rope data structure:

I’m not certain that your analysis of memory use as O(MxN) is strictly
accurate. I think that this assumes that the garbage collector doesn’t
run until the entire statement has completed, and I don’t know if that
is correct. Your argument is that in a chain of methods:

body = text.strip().replace('\x00', '').encode('utf8')

we end up with temporary memory usage of (approximately) four times the
original:

  1. the original text string;
  2. the temporary string returned by the strip() method;
  3. the temporary string returned by the replace() method;
  4. and the string returned by the encode() method.

But this assumes that each of these strings all coexist. Suppose the
garbage collector runs after replace() is called, but before encode() is
called. Then the temporary string from step 2 will be garbage collected
and its memory freed.

If you have a longer chain of methods, then I think it is more likely
that at least some of the temporary strings will be garbage collected,
and so O(MxN) is the worst case scenario – and more likely to apply
when you have only a few methods and M is small.

Likewise, sometimes the methods may not return a new string. For
example, step 2 above – the strip() method – may return the original
string in Python 3.9:

>>> s = 'abc def ... x/y/z'
>>> u = s.strip()
>>> s is u
True

so this will also tend to reduce the memory use, lowering the average
cost compared to the worst-case cost.

Anyway, this is a quibble, but I think an important one. Moving to a
rope data structure, or something similar, adds a lot of complexity, and
will have memory and performance costs. Even if ropes have better Big Oh
performance than strings, the extra complexity may still make them
slower than a naive contiguous array string. Either way, shifting to a
rope data structure is only worthwhile if it brings performance or
memory benefits. If we over-estimate the cost of a sequence of method
calls, then we exaggerate the benefit.

Nevertheless, this is an interesting idea that is worth further
investigation.

Presumably the cut-over point where the implementation swaps from a
naive string to a rope will have to be determined at run time, according
to the amount of memory available to the machine.

Thanks for that pointer. I didn’t know that such structure (which could be one of the solutions) had that name.

I think it’s an accurate worst case. Remember that the scenario is one of concurrency with a short period in which many long strings must be handled.

Also consider the in the scenario that some of the transformations may produce slightly longer strings, which means that the O(MxN) may happen even if the garbage collector kicks in between transformations.

Having the transformation from str to longstr happen transparently would be wonderful, but I think there’s an opportunity to have longstr (and longbytes) be in stdlib so the user can decide when it should apply.

len(dir(str)) is 80, but an implementation could cheat by defaulting longstr.method(...) implementations to longstr(str(self).method(...)) for less commonly used methods (the conversion to str probably produces a string of the original size).

Then, while thinking about this, the case can also be made for longlist.

Having the data structure name the search was easy. It seems that some methods mutate the structure, and that a lot of str methods are missing, but the problem seems solved here:

[ins] In [1]: from pyropes import Rope

[ins] In [2]: print(set(dir(str)) - set(dir(Rope)))
{'splitlines', 'partition', 'join', 'startswith', '__getnewargs__', 'lstrip', 'rsplit', 'maketrans', 'rindex', 'rpartition', '__mod__', '__contains__', 'istitle', 'title', 'rstrip', 'isspace', 'format_map', 'rfind', 'casefold', '__rmul__', 'removesuffix', 'endswith', 'ljust', 'removeprefix', 'index', 'expandtabs', 'strip', 'rjust', 'zfill', '__rmod__', 'center', 'translate', 'encode', 'format', 'count', 'replace'}

[ins] In [3]:

In CPython, the “garbage collector” is only used for objects which hold references to other objects (and then only to break circular references). The reference counting takes care of deallocating an object once its reference count goes to zero.

In the above case, the intermediate results only live on the stack and have a ref count of 1, so after a method is run, the original string object is deallocated. Leaving interning and small string caching aside, this will result in O(N) memory use, not O(N*M).

Things are different in implementations which use true garbage collection, e.g. Jython.

I could try the full analysis of the sequence of transformations in the original post, or figure out a nastier one, but I’ll leave that as homework :slight_smile:

Because the string after each transformation is of a different, large size, O(MxN) applies as worst case scenario.

Remember that because reference-counting cannot reclaim the string being transformed until the method finishes, there’s O(2xN) minimum on each transformation. Then, a method’s internals can easily fragment already freed blocks so there are no blocks of size O(N) available.

The memory fragmentation that happens is normal, and OK for small strings. For very large strings the sequence of transformations will bring O(XxN) memory into the pool for a single line of code, and I claim that it’s ~ O(MxN) for the worst case scenario.

It depends on what you call “memory use”. If unallocated, but currently still reserved memory managed by pymalloc or, for larger strings, the system’s malloc, counts as “memory in use”, you do have O(N*M) memory use. However, that’s not what you normally use as a metric for an algorithm.

If you want fast and memory efficient in-place manipulation of strings, you’re probably better off with an implementation that mimics string methods on a tree structure (such as ropes or other balanced trees). Python’s string objects are immutable, which has many advantages for their use by the language, e.g. a non-changing hash values, sharing, efficient reuse, interning, etc.

It’s the totally malloced memory what produces the problems.

And algorithms are involved. That’s why the post is titled as it is.