Consistent hash()

I am coding some classes and I wish to define the __hash__ method in order to use it to create some consistent IDs for the objects of these classes so that if their data structure (some other instance variables) do not change, then the ID is the same.
By using hash() though I understood by trials and errors that it changes across different python runs. Is there a pythonic solution to handle this use case?

For instance, I’m trying this but I am still debugging some run-time errors due to the complexity of handling either int or str or bytes with the different functions.

from typing import List, Optional, Union
from pydantic import BaseModel, Field, computed_field
import hashlib
import base64

class File(BaseModel):
    filename: str
    content: Optional[Union[str, bytes]] = None
    encoding: Optional[str] = None
    id_attachment: Optional[str] = Field(default=None, init=False)

    @computed_field
    @property
    def extension(self) -> str:
        return os_split_extension(self.filename)[-1].lower()
    
    def __hash__(self):
        if isinstance(self.content, str):
            # Decode content if it's base64 encoded
            content_bytes = base64.b64decode(self.content)
        else:
            content_bytes = self.content or b''

        # Concatenate filename and content bytes
        hash_input = self.filename.encode() + content_bytes
        return int.from_bytes(hashlib.sha256(hash_input).digest(), byteorder='big')

class EmailAddress(BaseModel):
    nickname: Optional[str] = None
    address: str

    def __str__(self):
        return self.address
    
    def __hash__(self):
        # Concatenate nickname and andress
        hash_input = (self.nickname or "").encode() + self.address.encode()
        return int.from_bytes(hashlib.sha256(hash_input).digest(), byteorder='big')

I don’t really like my code because it feels like it is not clean and prone to errors. How can I be consistent here? Can I exploit some libraries or pythonic ways to handle this better?

Also I wish to build other models from these ones, for instance. Should I somehow save the hash in order to build the hash of the composite model by getting the hashes of the instance variable models?

I would recommend the attrs library.

If you are ok with freezing the class, you can have a hash that should be invariant between sessions.

If you don’t want to freeze the class, the most reasonable hash depends on the pointer to the object, which will not be independent between sessions.

A hack I can think of is to create a frozen copy of your object, and then calculate the hash of that object. But it seems like a bad idea to implement such a hack in __hash__. That’s just asking for bugs and errors.

Try setting the PYTHONHASHSEED environment variable.

1 Like

Thank you for inputs, I am reading about attrs but I am not sure wether you are suggesting to use it instead of pyndatic or to mix them together (how?).

By the way, my end goal is to be able to compute those IDs based on their content: if the content is the same, the ID should be the same.

For instance, assuming Header and Body are complex Pyndatic models where I assumed I already implement a consistent __hash__ function, before the question I tried to solve by coding something like this:

class MailObject(BaseModel):
    header: Header
    body: Body
    mail_id: Optional[str] = Field(default=None, init=False)

    def __hash__(self):
        hash_input = str((hash(self.header), hash(self.body))).encode()
        combined_hash = hashlib.sha256(hash_input).digest()
        return int.from_bytes(combined_hash, byteorder='big')

    def __init__(self, **data):
        super().__init__(**data)
        # Compute mail_id using the __hash__ method
        self.mail_id = str(self.__hash__())

If I should not use __hash__ what should you use instead to solve my goal?

attrs would be used as follows:

import attrs

@attrs.frozen
class Point:
    x: float
    y: float

p = Point(1, 2)
p.__hash__()

Sorry I don’t actually know for certain whether it will conflict with BaseModel.

Also your problem with hashes changing between sessions appears to actually be an intended feature, that you can switch off by by setting the PYTHONHASHSEED environment variable to a constant like 0

I didn’t know this, and I haven’t ever done it myself.

See the below stackoverflow for a full explanation.

Your hash requirements a bit different than the ones Python’s __hash__ protocol are designed for, so mixing the two might not be a great idea.

A good design might be to implement your own my_hash free function that handles iterating over and encoding the model fields to create a consistent hash. You could also create a mixin class or decorator for injecting a my_hash method into your model classes.

As far as the hashing method goes, here’s some related SO questions that may be of some help.

1 Like