While that would be a useful feature to have in Jupyter it would still be useful to have outside of Jupyter as well. Reproducibility of scientific code is important but not all scientific code is run in Jupyter and in any case there are other reasons for wanting reproducilibity.
Another common case is reproducing bugs. You need to be able to reproduce something deterministically to be able to use e.g. git bisect:
In that issue what happened was that somewhere something iterated over a set and hash-randomisation made the iteration order non-deterministic. That should be fine because the SymPy code in question should compute the same result regardless of the iteration order of whichever set was being iterated over. It was not fine though because apparently depending on that iteration order the optimiser might or might not be triggered exposing a bug in the optimiser’s rewrite rules. Happily I could control the non-determinism in that case with PYTHONHASHSEED but it would have otherwise been much more painful to debug.
That’s precisely the OP’s use case, and I agree it would be good to have better reproducibility for that case.
Making the hash of None constant is a step in that direction, but it’s not clear to me (a) if it’s the right step, or (b) if it’s sufficient. And in the absence of clear answers to those questions, I’d rather look at the bigger picture before diving in and changing things.
You ask if it’s “sufficient” but then the question is: sufficient for what?
There are some cases where this would make the difference between something being reproducible and something being not reproducible and for those cases it is sufficient. Obviously there are other cases where this would not be enough but in fact nothing can ever lead to complete reproducibility in all cases so that’s too high a bar to set.
I tend to see the situation with reproducibility as being that every little helps. Or to put it the other way round it only takes one bad apple to ruin determinism so why make something non-deterministic if there is no particular reason to do so?
In the case of SymPy sets are used extensively and SymPy expressions have structural hash functions. Those structural hash functions would be deterministic if it wasn’t for hash randomisation but that is at least controllable. Mostly sets of SymPy expressions are used but also things like sets of integers and sometimes sets like {None, False, True}. As far as I know the only object used in sets throughout the codebase that uses id for its hash is None. What this means is that it is possible to have a large codebase where hash(None) is potentially the only source of uncontrollable non-determinism.
This is great, thank you, but perhaps there is another point to consider here.
From what I see, at least among the researchers in my own org, it is relatively well known that (assuming you want deterministic runs in your compute workload):
identity hashing has to be avoided (in any language)
random seeds need to be fixed (in any language)
The issue with bytes/str hash randomization and the PYTHONHASHSEED fix is far less known than (1), (2) ; I knew about it when I joined, and I remember at least some of those people being surprised when I told them about it.
Some of them had a misconception that it was the set implementation that was causing the non-determinism (as I imagine most people have when they see non deterministic behavior from keys with optional fields - they won’t realize it’s None causing it).
At least one person there did huge refactors where graphs (whose nodes were hashed as strings) are reconstructed in various places using dicts where nodes are inserted according to some manually dictated order, and similar complications that shouldn’t ever existed.
You don’t hear such people complaining here, the majority of programmers and scientists aren’t going to get to the bottom of such problems and then start fighting the holy wars in open source forums necessary to make it less of a hazard to the next person.
I don’t know. But we’re not talking about “making something non-deterministic”, we’re talking about making it deterministic when it’s currently not guaranteed to be deterministic, even though it is in a lot of cases. The hash of None is always -9223363242512554292 on my 64-bit Windows installation of Python 3.11.0. As far as I know, the only time it’s not deterministic is on Unix systems with address space randomisation switched on. A non-trivial proportion of systems, yes. But far from all of them.
I really don’t care an awful lot here. There was a PR. It was rejected. A core dev needs to care enough to override the rejection. That core dev won’t be me. If this was part of a set of changes that “made deterministic reproduction of bugs significantly easier”, then maybe it would be me. I can’t give you a precise set of criteria for waht would persuade me. All I can say is that changing the hash of Noneby itself isn’t enough (for me).