Testing formula in list_resize(…)

Hi! In “Objects / listobject.c”, list_resize(…) function is defined (which “over-allocates proportional to the list size, making room for additional growth”).

The body of the function includes the formula

new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;

… and the “growth pattern”:

* The growth pattern is: 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...

I thought that it would be interesting to write a program which would test these two (interconnected) things, …

import sys

MAX_APPENDINGS = 130
W = 5                   # column width
WORD_SIZE = 8           # bytes
FACTOR = 4              # for testing alignment

a = []
HEADER_SIZE = prev_size = sys.getsizeof(a)
n_cells = 0

print('Theory | Actual')

for i in range(MAX_APPENDINGS):
    a.append(0)
    if (size := sys.getsizeof(a)) != prev_size:
        print(f'{' ' * W if not i or theor == n_cells
                    else theor:{W}}{n_cells:{W}}')
        theor = n_cells + (n_cells >> 3) + 6 & ~3
        n_cells = (size - HEADER_SIZE) // WORD_SIZE
        prev_size = size
        if theor % FACTOR or n_cells % FACTOR:
            raise Exception(
                f'Some numbers are not multiples of {FACTOR}')

… and here is its output:

Theory | Actual
         0
         4
         8
   12   16
        24
        32
        40
   48   52
        64
        76
   88   92
       108
  124  128

In this table, left column, “Theory”, shows results of applying a formula, and the right one, “Actual”, — the result of the immediate working of the program. Numbers in “Theory” are printed only in cases when they do not coincide with the numbers in “Actual”.

So as we can see,

— on the one hand, the first ten numbers in “Actual” coincide with the sequence quoted above (“growth pattern”), which demonstrates that in certain respects our program works correctly, but

— on another hand, although “Theory” often coincides with “Actual”, sometimes it does not. Why?

Since in those cases when the numbers do not coincide, we see that Actual [i] == Theory [i] + 4, the first version which came to my mind is that this is related to the alignment, but when I looked at the numbers more carefully, it turned out not to be the case (although I’m not sure about that), because:

— all the numbers are multiples of 4, but not all of them are multiples of 8. This is tested in the last 3 lines of the program: if FACTOR == 4, exception is not generated, but if FACTOR == 8, exception is generated;

— if our hypothesis is correct, we should expect that the “Actual” numbers are better “aligned” than the “Theory” numbers, — in other words, we expect that in those cases when Actual [i] != Theory [i], Actual [i] is multiple of 8 whereas Theory [i] is not. But this is not always the case. Sometimes this is the case [in the cases of (12, 6) and (124, 128)], but some this is not the case [in the cases of (48, 52) and (88, 92)].

Finally, here are some comments on the code: size and prevsize are sizes of the list measured in bytes, whereas n_cells and theor are, respectively, actual and theoretical numbers of cells. Each cell consists of WORD_SIZE bytes. The list object consists of the header of the constant size (HEADER_SIZE) and of the dynamic part consisting of n_cells cells. n_cells is not the same as the len() of list, because resizing of list involves **over-**allocation of memory, so as to make these resizings occurring seldom. The last several sentences reflect my understanding of the internals of the lists in Python, which may be imprecise, but upon which the program is based.

So to repeat, the question is that I don’t fully understand how the formula words, or, in other words, why the numbers from the “Theory” column of the table do not always coincide with the ones from the “Actual”.

n_cells isn’t newsize. Use len(a):

theor = len(a) + (len(a) >> 3) + 6 & ~3
2 Likes

Thank you, Stefan; indeed, now the program works correctly.

… however, I think that your reply can be slightly improved:

since at those moments when theor is being computed, len(a) == n_cells + 1 (but, of course, this is not true at the other moments), we can replace, in the formula proposed by you, len(a) with n_cells + 1, and then we can simplify the resulting formula in the following way:

taking into account that n_cells is always even [not to say that it is multiple of 4 (which is achieved by … & ~3)], we get …

n_cells + 1 >> 3 == n_cells >> 3

… and, of course,

n_cells + 1 + (…) + 6 … == n_cells + (…) + 7 …

So the final formula is:

theor = n_cells + (n_cells >> 3) + 7 & ~3

… which differs from the original one just by replacement of 6 with 7.

This formula is slightly better that the one proposed by you for the following two reasons: 1) it computes slightly faster, 2) it is simpler from the mathematical point of view: we can analyze the behavior of list_resize(…) by means of a simple recurrent formula, without the need to actually construct any list objects, and so, for instance, we get the “growth pattern” in the following simple way:

n = 0
for _ in range(9):
    print(n := n + (n+1 >> 3) + 7 & ~3,
        end = ' ')

I think that’s worse, not better. Speed doesn’t matter here, and clarity suffers. You now made two modifications to the actual formula that cancel out, and you had to write some proof that they do cancel out.

I agree. If our purpose is just to test something experimentally, then it’s better to refrain from subsequent mathematical manipulations, because, even if they really simplify something, they may contain errors. So from this perspective, my previous message was off-topic, sorry.

This message does not constitute a question, — just a correction of my last but one message. I erred there when I wrote: “n_cells is always even”. While this is true in the program written in the first message, but this program does not perform the most general testing, — here is why.

In that program, I started from an empty list and then appended a zero at each step to it. In such circumstances, resizings of the list happened only at those moments when n_cells == len(a) - 1 == 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...

But I could perform testing of this formula in another way: by starting not from an empty list, but from a list of an arbitrary length. This would be a more general way of testing, because in this case, n_cells will not make jumps, — it will be equal to all the consecutive numbers: n_cells == 0, 1, 2, 3, 4, 5, 6, 7, ...

In these circumstances, n_cells will not always be even, and so, the only transformation that we can apply to the formula, is to replace newsize with n_cells + 1 (which is slightly more convenient when our purpose is not changing the CPython source code, but just a theoretical analysis of the formula). So here is the most general test of the slightly transformed formula:

import sys

size_of_header = sys.getsizeof([])
size_of_item = sys.getsizeof([None]) - size_of_header

for initial_len in range(10**4):
    a = [None]*initial_len
    new_size_theor = (initial_len + (initial_len + 1 >> 3) + 7 & ~3) \
                        *size_of_item + size_of_header
    a.append(None)

    if sys.getsizeof(a) != new_size_theor:
        print('not equal')
        break
else:
    print('ok')

Output:
ok