[pickle] Original data size is greater than deserialized one using pickle 5 protocol

Hi there,

Trying to serialize/deserialize an object using pickle 5 protocol. Having looked at the data size of the object before serialization and after deserialization, I wonder why the original data size is greater than deserialized one? Is that an issue or expected behavior? Note that this is a simplified example. In the original code serialization takes place in one process and deserialization in other to exchange the data between processes.

import numpy as np
import pickle as pkl
import sys


buffers = []

def callback(pickle_buffer):
    buffers.append(pickle_buffer)
    return False

array = np.array([1, 2, 3] * 10000)

packed_data = pkl.dumps(array, protocol=5, buffer_callback=callback)
unpacked_data = pkl.loads(packed_data, buffers=buffers)

print(sys.getsizeof(array))
# 240112
print(sys.getsizeof(unpacked_data))
# 112
print(all(np.equal(array, unpacked_data)))
# True

Thanks in advance!

When I try your code, I get this error:

>>> packed_data = pkl.dumps(array, protocol=5, buffer_callback=callback)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'buffer_callback' is an invalid keyword argument for dumps()

So I can’t replicate your results.

But if they are correct, my guess it probably has something to do with numpy array views vs copies.

It is strange that you get the error. What is your python version? Note that Changed in version 3.8: The buffer_callback argument was added. I just noticed that even without pickle 5 protocol the original data size is larger than deserialized. Also, if it had something to do with numpy array views, then making a change in the original array would reflect on its view. However, this is not the case.

import numpy as np
import pickle as pkl
import sys


array = np.array([1, 2, 3] * 10000)

packed_data = pkl.dumps(array)
unpacked_data = pkl.loads(packed_data)

array[0] = 1111111111111

print(array)
[1111111111111             2             3 ...             1             2
             3]
print(unpacked_data)
[1 2 3 ... 1 2 3]

print(sys.getsizeof(array))
# 240112
print(sys.getsizeof(unpacked_data))
# 112
print(all(np.equal(array, unpacked_data)))
False

sys.getsizeof() is very limited – it gets the size of the Python object, but does not include the size of objects contained in that object:

In [32]: l = [1,2,3]

In [33]: junk = l * 100

In [34]: l2 = [junk]

In [35]: sys.getsizeof(l)
Out[35]: 88

In [36]: sys.getsizeof(l2)
Out[36]: 64

so that second list is “smaller”, even though it hold a large object.

numpy arrays store a pointer to the data block that they work with, it looks like sys.getsizeof does include the size of the data block:

In [40]: arr1 = np.zeros(10)

In [41]: arr2 = np.zeros(1000)

In [42]: sys.getsizeof(arr1)
Out[42]: 192

In [43]: sys.getsizeof(arr2)
Out[43]: 8112

but not if the array is a “view” onto another array:

In [47]: arr3 = arr2[:]

In [48]: sys.getsizeof(arr3)

Out[48]: 112

sizeof must be reporting the data size if the array “owns” its data:

In [52]: arr2.flags
Out[52]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

In [53]: arr3.flags
Out[53]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

I’m surprised that the unpickled array does not appear to own its data, but that seems to be the case – check the flags.

But anyway, it’s working, the value(s) of the array are correct, so does this matter?

Oh, that is interesting. If we make a change in a view of the original array in your example, it will be reflected in the original array. However, if we make a change in the unplickied array, it will not be reflected in the original array.

import numpy as np
import pickle as pkl
import sys

# case 1
arr1 = np.zeros(10)
arr2 = arr1[:]
arr2[0] = 555
# the change is reflected in both arrays
print(arr1)
# [555.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
print(arr2)
# [555.   0.   0.   0.   0.   0.   0.   0.   0.   0.]

# case 2
arr3 = np.zeros(10)
packed_data = pkl.dumps(arr3)
unpacked_data = pkl.loads(packed_data)
unpacked_data[0] = 555
# the change is not reflected in both arrays
print(arr3)
# [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
print(unpacked_data)
# [555.   0.   0.   0.   0.   0.   0.   0.   0.   0.]

I am wondering what object owns the data of the unpickled array then? Doesn’t this seem to be an issue?

It’s puzzlement, yes, but have you had any problems?

Hmm, if the array doesn’t think it owns the data block, then what does? and if the answer is nothing, then there may be a memory leak here, as then nothing would delete the data when it’s done with it.

Yes, as noted at the beginning of my post, in the original code serialization takes place in one process and deserialization in other to exchange the data between processes. And I want to know the data size on both the sender side and the receiver side. However, it looks like I can’t do that with the current state of the things.

Do you think what would be the right place to report the issue? Would it be numpy?

Then sys.sizeof is not the right tool – if you want to know how big a numpy array is, use array.size.

But it seems you are trying to debug numpy – when there’s no indication of a bug :-). – a numpy array will be the same size after being unpickled unless numpy pickling is broken – and I don’t think you’ve had any other indication of an issue, have you?

yes, the numpy list is the place to ask about this – I dont hink anything is broken, but it is interesting.

Nope.

I will ask them. Thank you a lot for the discussion.