The attrs
package provides a feature to cache generated hashes for its dataclasses-like objects. This is, I believe, the only missing feature of dataclasses vs. attrs. I propose adding such a feature to the dataclasses module as well.
Even for modestly sized dataclass objects and low numbers of calls to hash()
, this can provide a meaningful speedup.
Consider the following test code:
from dataclasses import dataclass
# No hash caching
@dataclass(frozen=True)
class UncachedHash:
f0: int
f1: int
f2: int
f3: int
f4: int
# With hash caching
# If such a feature was added, it would look like:
# @dataclass(frozen=True, cache_hash=True)
@dataclass(frozen=True)
class CachedHash:
f0: int
f1: int
f2: int
f3: int
f4: int
def __hash__(self):
if hasattr(self, "_hash"):
return self._hash
# Similar code as in dataclasses._hash_add
self_tuple = (self.f0, self.f1, self.f2, self.f3, self.f4)
object.__setattr__(self, "_hash", hash(self_tuple))
return self._hash
from time import monotonic
for it in (1, 2, 5, 10, 100, 1000, 10000, 100000):
a = UncachedHash(0, 1, 2, 3, 4)
b = CachedHash(6, 7, 8, 9, 10)
print(it, end="\t")
t_uncached = 0.0
t_cached = 0.0
for _ in range(100):
time_start = monotonic()
for _ in range(it):
hash(a)
time_end = monotonic()
t_uncached += time_end - time_start
for _ in range(100):
if hasattr(b, "_hash"):
object.__delattr__(b, "_hash")
time_start = monotonic()
for _ in range(it):
hash(b)
time_end = monotonic()
t_cached += time_end - time_start
print(t_uncached, "\t", t_cached)
On a Mac M1, this results in the following performance:
# Iterations | Time Uncached [s] | Time Cached [s] |
---|---|---|
1 | 0.00004171 | 0.00005876 |
2 | 0.00006178 | 0.00008825 |
5 | 0.0001222 | 0.0001278 |
10 | 0.0002145 | 0.0001951 |
100 | 0.001908 | 0.001596 |
1000 | 0.01827 | 0.01402 |
10000 | 0.1611 | 0.1342 |
100000 | 1.633 | 1.339 |
That is, after 5-10 calls to hash()
, caching the hash becomes beneficial.
I am happy to draft a PR if this is of interest.
Of course, the cached hash must not be used for pickling (since the hash might change across Python invocations).
This option should only apply if the dataclasses
module adds an automatic hash function (not a custom hash function defined in the dataclass itself, or the default object.__hash__
implementation). In other words, only if frozen=True
(and eq=True
, which is the default), or when unsafe_hash=True
. In all other cases, setting cache_hash=True
should raise an error. This is the same behavior as in attrs.