Pickle.dumps variables with the same content get different results

I execute the same function three times, but the binary string of the variable returned for the first time after pickle.dumps is always different, while the others are the same.
It is worth noting that the contents of these variables are the same, just after the dumps there are some differences.
In fact, I discovered this problem by calculating sha1.
This problem is 100% reproducible in my program, but my attempt to create a simple sample failed.
I hope the discussion group can give me some advice.

Below is the code where I’m having problems

            dataset = self.load_without_cache(**kwargs)
            dataset2 = self.load_without_cache(**kwargs)
            dataset3 = self.load_without_cache(**kwargs)

            d1_sample = dataset[0][0][0].ss_sample
            d2_sample = dataset2[0][0][0].ss_sample
            d3_sample = dataset3[0][0][0].ss_sample

            for name in dir(d1_sample):
                d1_mem = getattr(d1_sample, name)
                d2_mem = getattr(d2_sample, name)
                d3_mem = getattr(d3_sample, name)

                d1hash = sha1()
                d1_byte = pickle.dumps(d1_mem)
                d1hash.update(d1_byte)

                d2hash = sha1()
                d2_byte = pickle.dumps(d2_mem)
                d2hash.update(d2_byte)

                d3hash = sha1()
                d3_byte = pickle.dumps(d3_mem)
                d3hash.update(d3_byte)
                print(f"{name}'s d1 == d2? ---> {d1_mem ==d2_mem}")
                print(f"{name}'s d2 == d3? ---> {d2_mem ==d3_mem}")

                print(f"{name}'s d1hash == d2hash? ---> {d1hash.hexdigest() ==d2hash.hexdigest()}")
                print(f"{name}'s d2hash == d3hash? ---> {d2hash.hexdigest() ==d3hash.hexdigest()}")

                if not d1hash.hexdigest() == d2hash.hexdigest():
                    print(f"{name}'s 1,22 hash is diff")

                if not d2hash.hexdigest() == d3hash.hexdigest():
                    print(f"{name}'s 22,333 hash is diff")
                print('')

I iterate through all variables of one of the samples, but there will be a problem where the contents are the same but different after dumps.

d1_sample and d2_sample is different after dumps, while d2_sample and d3_sample are always the same.
But the content inside is the same, use == to judge and return True
image

I hope the discussion group can give me some advice to help me troubleshoot the problem.

Have you looked at the bytes that are produced and looked for a difference?

There is information about the pickle format pickle — Python object serialization — Python 3.12.0 documentation and the pickletools — Tools for pickle developers — Python 3.12.0 documentation for inspecting pickle files.

Why is that a problem for you?

And are the dicts completely identical, including order and data types, all that also deeper in the structure If it isn’t flat?

class X:
    pass

>>> x = X()
>>> y = X()
>>> x.__getattribute__ == y.__getattribute__
False
>>> pickle.dumps(x.__getattribute__) == pickle.dumps(y.__getattribute__)
True  # not equal, still the same pickle string

In the above x.__dict__ and y.__dict__ or identical (empty dicts), but

>>> x.a = x
>>> y.a = x
>>> x.a == y.a 
True    # of course
>>> pickle.dumps(x.a) == pickle.dumps(y.a)
True
>>> x.__dict__ == y.__dict__
True
>>> pickle.dumps(x.__dict__) == pickle.dumps(y.__dict__)
False  # equal objects, yet different pickle strings

So, I don’t know what is happening in your code, since you didn’t show what ss_sample is or what its attributes are (or how __eq__ is defined for various attributes), but it’s pretty easy to create examples where two objects that are equal (__eq__) pickle to different byte strings, or two that are not equal pickle to the same byte string.

This is in general not really relevant or problematic for pickling/unpickling, I think. Afaik it could only be relevant when an object contains a self-reference (or in cases where __eq__ has a special definition or where it misses a special definition but really needs one). Meaning that for those kind of objects, it could sometimes be better to define a custom pickler (see pickle docs) (even though in general that is also not necessary, since pickle will handle the self-reference correctly).
So, I have the same question as @pochmann: Why is what you noticed a problem for you?

1 Like

Some more examples of equality not matching pickle equality:

from pickle import dumps as p

a, b = object(), object()
print(a == b, p(a) == p(b))

a = b = float('nan')
print(a == b, p(a) == p(b))

a, b = 0, 0.0
print(a == b, p(a) == p(b))

a, b = {0, 8}, {8, 0}
print(a == b, p(a) == p(b))

Output (Attempt This Online!):

False True
False True
True False
True False
1 Like

I understand your example.
But the main point I want to discuss is the following situation:

from pickle import dumps
from pathlib import Path
from hashlib import sha1


class Test:
    def __init__(self, name, age, path) -> None:
        self.name = name
        self.age = age
        self.path = path


path = Path(rf'./test')
name = 'test'
age = 11

t1 = Test(name, age, path)
t2 = Test(name, age, path)

t1hash = sha1()
t1hash.update(dumps(t1))
print(f't1 hash is {t1hash.hexdigest()}')

t2hash = sha1()
t2hash.update(dumps(t2))
print(f't2 hash is {t2hash.hexdigest()}')

I have a class that uses the same parameters to create an instance. Should their sha1 values be the same, different, or random?

My output screenshot is to show that the three instances I created using the same method 1 and 2 are different, but 2 and 3 are the same.
To be precise, only 1 is different from the others,2,3, 4,…their sha1 values are the same.

As I replied to Hans Geuns-Meyer, the specific content I want to discuss is not whether such a situation exists, but whether the sha1 values of objects instantiated with the same parameters should be the same. QWQ

Thank you very much for your suggestions. I’ll try pickletools to see if it can help me.
I may still need to ask you if I have any other questions.

In your last example the sha1 values are identical, because they get the same input. The sha1 is a deterministic function. But I don’t think there is a general answer to your question. If two instances (of just any kind of class) are created with the same parameters, or even if they are equal (x == y), then this does not imply that their pickled representations are identical.

None of that, I’d say. Pickling is for serializing and de-serializing, not for hashing. Why are you using it for hashing?

The logic of this part of the code is like this. I have a data set, and there are a lot of small files in this data set. At the same time, this is a multi-process program. Therefore, I hope to read and then dump in the main process, and then load in other processes. At the same time, under different configurations, the content read is also different, so I use sha1 as the file name to ensure correct reading.

It is an interesting question to ask, why is the pickle content not stable?
That is separate from if this is a reasonable property to depend on.

Depending on what you mean with stable, that has either already been answered (well, at least some reasons have been shown) or has not been demonstrated to be the case.

Ok, I think I somewhat understand what you’re doing, but it’s still not clear to me what problem you have. To ensure correct reading, read the file in binary mode, compute its sha1 and compare with the file name? What goes wrong there? Or do you unpickle the file, re-pickle it, and check the sha1 of that, and that fails?

Ok, with the following code, I will describe in detail what problem I found.

        if not nor_cache_path.exists() and global_param.local_rank == 0:
            dataset = self.load_without_cache(**kwargs)
            self._cache_to_file(nor_cache_path, dataset)
        distributed.barrier()
        return self.load_from_cache(nor_cache_path)

First cache if it does not exist, the main process reads and builds the dataset, then caches the entire dataset using dumps.
Then all processes read the cache file (including the main process) and return.
The problem is that in the main process, the sha1 of the file I dumped is different from the variable obtained by load.

So I did some digging.
I hope I thought it was a pickle problem at first. So I directly loaded it twice using load_without_cache() and calculated the sha1 of the dataset built twice. Then, I discovered that it was not a problem with pickle, but that the sha1 of the two return values were directly different.