Will pickle change the hash value of the variable?

I wanted to use pickle to cache the data I loaded, but I found that when I used pickle.load to load the variables of the dump, their sha1 values changed.

You can use the following procedure to reproduce the problem. First you need to run GenerateDate.py to generate some data. Then, you need to run load.py to reproduce the problem.

In load.py, the sha1 value of the original built variable (ori) will be output, as well as the sha1 value loaded by load in the two processes (load_data, load_data_more, load_data_2). You will find that these values ā€‹ā€‹will be somewhat different.

For example, running on my computer the result is:

rank 0, ori hash is 2a0e60672abab8ba4816d2cfeb228797fa5ae2de
rank 1, load hash is 797e95e3048c36cca070c8f9778c908ece7fc59f
rank 0, load hash is 7d48bfc9a7e4282845a5eff0cfd00e29c10818b7
rank 0, more load hash is 7d48bfc9a7e4282845a5eff0cfd00e29c10818b7
rank 1, more load hash is 7d48bfc9a7e4282845a5eff0cfd00e29c10818b7
rank 1, 2 load hash is 7d48bfc9a7e4282845a5eff0cfd00e29c10818b7
rank 0, 2 load hash is 7d48bfc9a7e4282845a5eff0cfd00e29c10818b7

When I use the sha1sum program to calculate the sha1 of the dumped file (cache.pkl), it is the same as the sha1 of ori.

I want to know whyļ¼Ÿ

GenerateDate.py

import pickle
import random
from pathlib import Path
from typing import Sequence

import numpy as np


def pickle_once(frame_names: Sequence, frame_data: np.ndarray, target_path: Path):
    target_path.mkdir(parents=True, exist_ok=True)

    with open(target_path / 'frame_names.pkl', 'wb') as f:
        pickle.dump(frame_names, f)
    with open(target_path / 'frame_data.pkl', 'wb') as f:
        pickle.dump(frame_data, f)


pose_root = Path(r'./pose_data')
silh_root = Path(r'./silh_data')



for i in range(100):
    for j in range(3):
        data_path = pose_root / f"{i:04}" / f"seq-{j}"
        frame_num = random.randint(10, 100)
        data = np.random.randn(frame_num, 17, 3)
        frame_name = [f'{_:05}' for _ in range(frame_num)]
        pickle_once(frame_name, data, data_path)


for i in range(100):
    for j in range(3):
        data_path = silh_root / f"{i:04}" / f"seq-{j}"
        frame_num = random.randint(10, 100)
        data = np.random.randn(frame_num, 64, 44)
        frame_name = [f'{_:05}' for _ in range(frame_num)]
        pickle_once(frame_name, data, data_path)

load.py

import multiprocessing as mp
import pickle
import re
from multiprocessing import Event
from pathlib import Path
from typing import Dict, List, Any
from hashlib import sha1

import numpy as np


class Sample:
    def __init__(self, sample_id, properties) -> None:
        self.id = sample_id
        self.properties = properties

    def __repr__(self):
        return f"ID {self.id}, properties {self.properties}"


class SequenceSample(Sample):
    def __init__(self, sample_id, properties, path, cache=False):
        super().__init__(sample_id, properties)
        self.path = path
        self._data = None
        self._names = None

        if cache:
            self._names = self.frame_names
            self._data = self.__load__()

    @property
    def data(self):
        """

        Returns:

        """
        if self._data is not None:
            return self._data
        return self.__load__()

    @property
    def frame_names(self):
        if self._names is None:
            with open(self.path / 'frame_names.pkl', 'rb') as f:
                self._names = pickle.load(f)
        return self._names

    def __load__(self):

        with open(self.path / 'frame_data.pkl', 'rb') as f:
            return pickle.load(f)


class PairSample(Sample):
    def __init__(self, sample_id, properties, pose_path, silh_path, cache: bool = False):
        super().__init__(sample_id, properties)
        self.pose_sample = SequenceSample(sample_id, properties, pose_path, cache)
        self.silh_sample = SequenceSample(sample_id, properties, silh_path, cache)
        self._common_frame_name = None
        self._common_frame_data = None

        if cache:
            self._common_frame_name = self.frame_names
            self._common_frame_data = self.data

    @property
    def data(self):
        if self._common_frame_data is not None:
            return self._common_frame_data
        return self.__load__()

    @property
    def frame_names(self):
        if self._common_frame_name is not None:
            return self._common_frame_name

        pose_name = self.pose_sample.frame_names
        silh_name = self.silh_sample.frame_names
        self._common_frame_name = sorted(list(set(pose_name) & set(silh_name)))
        return self._common_frame_name

    def __load__(self):
        pose_idx = np.asarray([self.pose_sample.frame_names.index(_) for _ in self.frame_names])
        silh_idx = np.asarray([self.silh_sample.frame_names.index(_) for _ in self.frame_names])
        return self.pose_sample.data[pose_idx], self.silh_sample.data[silh_idx]


class Person:

    def __init__(self, person_id):
        self.id = person_id
        self.samples = []

    def append(self, element):
        self.samples.append(element)

    def __getitem__(self, item):
        return self.samples[item]

    def __len__(self):
        return len(self.samples)

    def __iter__(self):
        return iter(self.samples)

    def __repr__(self):
        return f"ID {self.id}, samples number {len(self.samples)}"


def load_without_cache(pose_root, silh_root) -> List[Person]:
    if not isinstance(pose_root, Path):
        pose_root = Path(pose_root)
    if not isinstance(silh_root, Path):
        silh_root = Path(silh_root)

    people_dict: Dict[str, Person] = {}

    pose_seq_paths = sorted(list(pose_root.glob("*/*")))
    for pose_seq_path in pose_seq_paths:
        relative_path = pose_seq_path.relative_to(pose_root)
        silh_seq_path = silh_root / relative_path
        if not silh_seq_path.exists():
            continue

        pid, properties = relative_path.parts

        seq_properties = {}
        for prop in properties.split('_'):
            key, value = re.match(r"(\w+)-(\S+)", prop).groups()
            seq_properties[key] = value

        pair_sample = PairSample(pid, seq_properties, pose_seq_path, silh_seq_path)
        if len(pair_sample.frame_names) == 0:
            continue

        if pid in people_dict:
            person = people_dict[pid]
        else:
            person = Person(pid)
            people_dict[pid] = person
        person.append(pair_sample)

    people = list(people_dict.values())
    people = sorted(people, key=lambda x: x.id)
    return people


def get_hash(data: Any):
    sha = sha1()
    tar_data = pickle.dumps(data)
    sha.update(tar_data)
    return sha.hexdigest()


def cache_to_file(cache_path, data):
    with open(cache_path, "wb") as f:
        pickle.dump(data, f)
        f.flush()


def load_from_cache(cache_path: Path):
    with open(cache_path, "rb") as f:
        dataset = pickle.load(f)
    return dataset


def proc_func(rank: int, event: Event):
    cache_dir = Path("./cache")
    cache_dir.mkdir(parents=True, exist_ok=True)
    cache_path: Path = cache_dir / "cache.pkl"

    if rank == 0:
        dataset = load_without_cache(r'./silh_data', r'./silh_data')
        print(f"rank {rank}, ori hash is {get_hash(data=dataset)}")
        cache_to_file(cache_path, dataset)
        event.set()
    event.wait()
    load_data = load_from_cache(cache_path)
    print(f"rank {rank}, load hash is {get_hash(data=load_data)}")

    load_data_more = load_from_cache(cache_path)
    print(f"rank {rank}, more load hash is {get_hash(data=load_data_more)}")

    load_data_2 = load_from_cache(cache_path)
    print(f"rank {rank}, 2 load hash is {get_hash(data=load_data_2)}")
    return load_data


if __name__ == '__main__':
    mp.set_start_method('spawn')
    barrier = Event()
    print(barrier.is_set())
    procs = []
    for i in range(2):
        proc = mp.Process(target=proc_func, args=(i, barrier))
        procs.append(proc)
        proc.start()

    for proc in procs:
        proc.join()
        proc.close()


2 Likes
  • Pickle doesnā€™t change any hashes. Pickling and hashing are unrelated. You have asked this same question about a month ago. It is possible to have two different pickled binary representations (and thus two different sha1 values based on that) that still will deserialize to the same complex object.
  • If you want to use hashing to tell if two objects are the same, it seems better to add __eq__ and/or __hash__ dunder methods to all the custom classes and use those methods. The hash of the pickled binary string has no value ā€“ using that may lead to incorrect conclusions about object identity.
2 Likes

Yes, I asked the same question last month and I didnā€™t provide enough demonstration examples at that time. This time I provided a reproducible example.
I understand what you mean, but Iā€™d rather know what makes them different.

In my example, even for the same pkl, the sha1 values obtained for the first load and the second load are different. Is this normal? Then why are they all the same later?
If the same file is read, it should have the same hash value.
I donā€™t know much about the specific process of pickle, so I have a lot of doubts.

If I use pickle to cache variables, what should I use to determine the correctness of the identity of the pkl (from what I understand, it seems to be using a hash value?)

Thank you very muchļ¼ļ¼QWQ

1 Like

I suspect you just have a race. Why donā€™t you print out the actual values here and see what they are?

1 Like

It will have.

So, Iā€™m not really able to evaluate all your code ā€“ itā€™s just too much. What I did see is that your initial load without cache and the load with cache do seem to generate the same objects even though the binary pickled string of the dataset and the load_data may be different.
If the pickle dumps (the pickled binary strings) are different, then the sha1ā€™s of those dumps will of course also be different. Thatā€™s why I said you should not rely on that.

What I suggest is:

  • add __eq__ to all your custom classes, so that you can easily compare the actual equality of the top-level list that is loaded without cache and the list that is loaded after unpickling the saved file
  • you can then verify that those lists are indeed identical ā€“ have identical content (without an __eq__ method you will not really be able to do this, or it will be very messy code)
  • you can then also verify that the actual pickled strings may be different - but thatā€™s unimportant, as long as unpickling reconstructs ā€œthe sameā€ object as you had originally

That also occurred to me - but I think something else is going wrong. The different pickles also occur when you take out all the multiprocessing (I did since I got annoyed about not seeing exactly what caused the diff - I still donā€™t really see how/where itis caused, but the code is also just too complex, so I gave up on that part :slight_smile: )
It would be nice if @Wang-MieMie would provide a small code example to illustrate the original issue.

1 Like

I tweaked load.py to do:

dataset_1 = load_without_cache(r'./silh_data', r'./silh_data')
dataset_2 = load_without_cache(r'./silh_data', r'./silh_data')
print(dataset_1 == dataset_2)

It printed False.

1 Like

Sorry, there is no way I can make the code simpler. Because when I try to use only a single PairSample or SequenceSample, this does not happen.

Sorry, there is no way I can make the code simpler. Because when I try to use only a single PairSample or SequenceSample, this does not happen.
And if I let load_without_cache only return a single Person class (return people[0]), the output is similar to the following:

rank 0, ori hash is d8dda8d636bb89c706f8865b8d49d87a77da98ed
rank 0, load hash is d8dda8d636bb89c706f8865b8d49d87a77da98ed
rank 1, load hash is d8dda8d636bb89c706f8865b8d49d87a77da98ed
rank 0, more load hash is 56bdfd39d6c977bc954d32a58468cff71b6139f7
rank 0, 2 load hash is 56bdfd39d6c977bc954d32a58468cff71b6139f7
rank 1, more load hash is 56bdfd39d6c977bc954d32a58468cff71b6139f7
rank 1, 2 load hash is 56bdfd39d6c977bc954d32a58468cff71b6139f7

Although in this case the two loads are still different.QAQ

Such complex changes only occur when I return a List[Person]. (ori != first load != second load)

Sorry, Iā€™m not participating in the competition (although it is also available), I just want to use pickle to speed up my loading of data (because the data set consists of millions of very small files, which I want to read once and build The dataset is then pickled into a whole file to load.)

I meant a data race, not a competition, but based on the reply from @hansgeunsmeyer it seems like thatā€™s not the issue.

I think that by itself is misleading since those are lists of objects that donā€™t have (well-defined) __eq__ methods, so equality falls back on using id(obj). If those methods are added, the datasets will be equal.

Similar to:

class X:
    def __init__(self, a):
        self.a = a

x = X([1,2,3])
y = X([1,2,3])

>>> x.a == y.a
True
>>> x == y
False
>>> X.__eq__ = lambda self, other: isinstance(other, X) and other.a == self.a
>>> x == y
True

In this case, if you donā€™t define __eq__ you will also have that

pickle.loads(pickle.dumps(x)) != x

So, perhaps this is the core of the problem here? Is that what you were suggesting Matthew?
Itā€™s still not a complete answer, I think.

To come back to that questionā€¦ To verify correctness of the pickled file contents, I think the best method is to unpickle them and then assert that the unpickled object equals the original object. For this you have to define __eq__ methods in all your classes, else you will get misleading results. You should also not look at the low-level binary pickle strings themselves since those might be different even for ā€œthe sameā€ object. (There is no reason to doubt there are any issues with pickling itself though.)

To quickly check whether the files themselves are corrupt or not, you could use hashes, for instance use the sha1 of their binary contents as filename and then compare those later. But thatā€™s a completely separate concern - unrelated to pickling. You could do sth similar with any kind of binary file (or with any kind of text file too).

1 Like

I think it may be that the method decorated with @property is called during pickle. In my example, frame_names should be called. Because there is no explicit function call in frame_names, frame_names is completely executed, and there is an explicit function call in data. Pickle detects it and terminates the call.

Of course, this is just a guess, and this guess cannot explain why his sha1 is different during multiple loads.

No, thatā€™s not a problem.

I think I found the issue. Itā€™s a semi-bug caused by the use of pathlib.Path:

  • If you change the path member variable in SequenceSample to be a string instead of a Path, then ā€œload_without_cacheā€ and ā€œload_from_cacheā€ will always have the same pickle dumps. (You also need to make some tweaks then to the load functions to convert that string into a Path again or to to use strings as filenames).
  • If you donā€™t make that change and add __eq__ methods, you can see that the loaders really do return identical lists (as I said before), but in that case the pickled dumps will be different.

So this should do it:

class SequenceSample(Sample):
    def __init__(self, sample_id, properties, path, cache=False):        
        super().__init__(sample_id, properties)
        self.path = str(path)   # << only change here
        self._data = None
        self._names = None
        if cache:
            self._names = self.frame_names
            self._data = self.__load__()

    @property
    def frame_names(self):
        if self._names is None:
            with open(Path(self.path, 'frame_names.pkl'), 'rb') as f:
                self._names = pickle.load(f)
        return self._names

    def __load__(self):
        with open(Path(self.path, 'frame_data.pkl'), 'rb') as f:
            return pickle.load(f)

So, now I wonder is this a real bug in pathlib.Path or is there still sth else going on? I havenā€™t been able to reduce this behavior (=generation of different serialized pickle dumps, even though those dumps deserialize to the same objects) to a minimal code example yet.

Perfect! Finally solved this strange problem.

But, I donā€™t understand what you mean by ā€˜real bugā€™? Does it mean that the Pickle Path itself triggers this problem?

I currently have no way to continue simplifying this program. When I try to continue simplifying, this problem disappears.

Should we raise an issue on Github?

Yes, I donā€™t know if this is somehow an issue in pathlib.Path itself - The issue would be that member variables that are Paths serialize to different pickled serializations (in some cases). Itā€™s not a major issue, since unpickling works fine, but itā€™s kind of unexpected and an annoyance. Itā€™s also not reproducible with simpler classes, but very reproducible in your code. But I think we cannot/shouldnā€™t raise an issue on the GitHub ā€“ or even open a separate thread ā€“ unless we can really find a more minimalistic code snippet that exhibits this behavior.

1 Like

Does this mean that when picking, the problem caused by the combination of path and certain code fragments is not a problem with the path itself.
I am very unclear about the specific execution process of Pickle. In fact, I do not understand the python source code QAQ.
This may require Pickleā€™s specific process analysis to find out where the problem lies.
In fact, when I printed out the pickled binary string, I found that there were only a few differences.

Right - There is nothing wrong with the Path veriables themselves. Also, there is nothing really wrong with pickling. Itā€™s just that the pickled serializations of ā€œthe sameā€ objects can be slightly different if those objects are container like objects that include Path member variables. (And I believe also in other scenarios, though I cannot think of ones now.)

When using pickle the basic contract is just that unpickling a pickled object gives you back ā€œthe sameā€ object. It doesnā€™t require that pickled serializations of that object are always identical as binary strings - In that sense itā€™s just an annoyance, not a ā€œrealā€ bug.

1 Like

I have never quite understood why you said that pickle and unpack binary strings can be different. In my opinion, pickle gets binary strings - A and saves it to the file. When read, it should still be A.
In this sentence, the more discussion is that the file written and the file read should be exactly the same, and pickle just converts the file into a Python variable.

The Hash value is just to verify whether the file is the same when saving and reading.

I know what youā€™re saying is that variables with the same literal value may have different hash values.
However, in my case they are the same variable.

Iā€™m very sorry, I forgot that I verified the hash after re-pickling. When verifying files directly, they were always the same as ori.

1 Like

We can make an analogy to compression: you can compress data two different ways, giving you two files that look different, but when you decompress them you get the exact same data. The exact bytes of the compressed version are not important.

The reason pickle might produce different files is due to other factors, but itā€™s the same situationā€“the files differ but they represent the same thing. You can use the file contents as a key for caching but it will be less efficient than the real value.

1 Like