Dataclasses: add a cache_hash option, similar to attrs'

mdiener · December 3, 2024, 1:00am

The attrs package provides a feature to cache generated hashes for its dataclasses-like objects. This is, I believe, the only missing feature of dataclasses vs. attrs. I propose adding such a feature to the dataclasses module as well.

Even for modestly sized dataclass objects and low numbers of calls to hash(), this can provide a meaningful speedup.

Consider the following test code:

from dataclasses import dataclass

# No hash caching
@dataclass(frozen=True)
class UncachedHash:
    f0: int
    f1: int
    f2: int
    f3: int
    f4: int

# With hash caching
# If such a feature was added, it would look like: 
# @dataclass(frozen=True, cache_hash=True)
@dataclass(frozen=True)
class CachedHash:
    f0: int
    f1: int
    f2: int
    f3: int
    f4: int

    def __hash__(self):
        if hasattr(self, "_hash"):
            return self._hash

        # Similar code as in dataclasses._hash_add
        self_tuple = (self.f0, self.f1, self.f2, self.f3, self.f4)

        object.__setattr__(self, "_hash", hash(self_tuple))
        return self._hash

from time import monotonic

for it in (1, 2, 5, 10, 100, 1000, 10000, 100000):
    a = UncachedHash(0, 1, 2, 3, 4)
    b = CachedHash(6, 7, 8, 9, 10)

    print(it, end="\t")

    t_uncached = 0.0
    t_cached = 0.0

    for _ in range(100):
        time_start = monotonic()
        for _ in range(it):
            hash(a)
        time_end = monotonic()

        t_uncached += time_end - time_start

    for _ in range(100):
        if hasattr(b, "_hash"):
            object.__delattr__(b, "_hash")

        time_start = monotonic()
        for _ in range(it):
            hash(b)
        time_end = monotonic()

        t_cached += time_end - time_start

    print(t_uncached, "\t", t_cached)

On a Mac M1, this results in the following performance:

# Iterations	Time Uncached [s]	Time Cached [s]
1	0.00004171	0.00005876
2	0.00006178	0.00008825
5	0.0001222	0.0001278
10	0.0002145	0.0001951
100	0.001908	0.001596
1000	0.01827	0.01402
10000	0.1611	0.1342
100000	1.633	1.339

That is, after 5-10 calls to hash(), caching the hash becomes beneficial.

I am happy to draft a PR if this is of interest.

Of course, the cached hash must not be used for pickling (since the hash might change across Python invocations).

This option should only apply if the dataclasses module adds an automatic hash function (not a custom hash function defined in the dataclass itself, or the default object.__hash__ implementation). In other words, only if frozen=True (and eq=True, which is the default), or when unsafe_hash=True. In all other cases, setting cache_hash=True should raise an error. This is the same behavior as in attrs.

csm10495 · December 3, 2024, 6:29am

Would this be allowed only on immutable data classes or on all?

mdiener · December 3, 2024, 2:03pm

That’s a good question, thanks. This option should only apply if the dataclasses module adds an automatic hash function (not a custom hash function defined in the dataclass itself, or the default object.__hash__ implementation). In other words, only if frozen=True (and eq=True, which is the default), or when unsafe_hash=True. In all other cases, setting cache_hash=True should raise an error. This is the same behavior as in attrs.

csm10495 · December 3, 2024, 8:39pm

To play devil’s advocate, why not just always cache the hash if the dataclass is frozen?

saaketp · December 3, 2024, 8:52pm

Quoting from the attrs thread that added this feature (Caching hashcodes · Issue #423 · python-attrs/attrs · GitHub):

mdiener · December 3, 2024, 8:54pm

I wouldn’t be opposed. However, there is a small performance hit from caching the value, which not everyone would support, I suppose. Also, some frozen dataclasses may contain mutable objects in their fields whose hash value might change from mutating them.