Integer "memory leak"

Apparently I can make for example the int 257 take arbitrarily much memory:

import tracemalloc as tm

s = '0'*10**7 + '257'
tm.start()
i = int(s)
print(i, tm.get_traced_memory()[0])

Output with Python 3.10.2 shows it takes 4.4 MB:

257 4429264

Are there other ways besides using int() to cause this? I tried for example adding two very long numbers whose sum is 257, but that didn’t cause it.

How about this one:

Python 3.10.7 (main, Sep 10 2022, 07:37:51) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tracemalloc as tm
>>> x = 1 << 30000000
>>> y = x - 257
>>> tm.start()
>>> i = x - y
>>> print(i, tm.get_traced_memory()[0])
257 4009531

Interesting. I confirm that. I had tried subtraction, but slightly differently: If I instead use y = x + 257 and i = y - x, it’s only 28 bytes.

(Though I’m testing with Py 3.8 pre-release at TIO now )

Yep. Take a look at the function x_sub in longobject.c, which makes an effort to strip leading “digits” (i.e., limbs) of x and y while they match. In the case y = x + 257, the leading digits match all the way down to the lowest digit. In the case y = x - 257, there’s a mismatch in lengths right at the start.

Here’s another example:

Python 3.10.7 (main, Sep 10 2022, 07:37:51) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tracemalloc as tm
>>> x = 10**10**7
>>> y = x + 257
>>> tm.start()
>>> i = y % x
>>> print(i, tm.get_traced_memory()[0])
257 4438801

In other words, if you’re looking for a general principle that Python ints shouldn’t waste space, then I’m afraid that there isn’t one. There’s an effort not to gratuitously waste space, but that’s not quite the same.

3 Likes

Is that the size of i? Or is it the amount of memory that the C runtime and Python runtime allocated?

:>>> x = int('0'*10**7 + '257')
:>>> sys.getsizeof(i)
28

i is only 28 bytes.

I expect that the rest of the memory that tracemalloc is reporting is used to process the int(s) and has been put into look aside lists for fast reuse later. I recall hearing about
such lists in python and C runtimes - but have not read the python code to confirm.

Yeah, unfortunately int.__sizeof__ is not perfectly accurate in these cases, as it computes only based on Py_SIZE(self), which is set to the “virtual” size, i.e., big enough to only include the last nonzero digit, regardless of the actual size of the underlying allocated memory.

2 Likes

As far as I can tell, it’s allocated and not reused (as long as the int lives).

If I create a list of a hundred such ints, allocation accordingly rises to 443 MB:

import tracemalloc as tm

a = ['0'*10**7 + '257'] * 100
tm.start()
b = list(map(int, a))
print(f'{tm.get_traced_memory()[0]:,}')

Output (Try it online!):

442,927,376

That seems very odd behaviour. Which I guess is what you are getting at.

Yes, that’s why I called it “leak”, as it keeps temporarily used memory allocated. In quotes because it does get freed when the ints get deleted and because it’s apparently not an oversight.

int() and calculations use long_normalize:

/* Normalize (remove leading zeros from) an int object.
   Doesn't attempt to free the storage--in most cases, due to the nature
   of the algorithms used, this could save at most be one word anyway. */

static PyLongObject *
long_normalize(PyLongObject *v)

Sounds like it was decided that larger savings are too rare or insignificant to care about.

What seems odd is that there are so many leading zeros.
I guess that is down to the string to int conversion algorithm guessing the size required for the result assuming no leading “0” in the input.
And thus avoiding a reallocate of the int data,

FWIW, I’d be happy to accept a PR that modified the str to int conversions to take leading zeros into account before allocating space for the result. It would still end up overestimating the number of limbs needed in some cases, but with some care we shouldn’t ever be overestimating by more than one limb for reasonably-sized numbers.