Constant hash for None

This post was flagged by the community and is temporarily hidden.

I cannot replicate that.

$ for a in `seq 0 6`; do python3.10 -c "print(hash(None))"; done
472820
472820
472820
472820
472820
472820
472820

Compare the same for strings:

$ for a in `seq 0 2`; do python3.10 -c "print(hash('a'))"; done
7858830418497448134
3024585495073917379
-7844136121483809857

Are you sure that the hash of None changes on every run? It certainly changes on different versions of Python on different machines, but I don’t see any sign of hash randomisation in the same interpreter on the same machine.

Some further thoughts.

It is an unsafe assumption that set iteration order will depend only on hash values.

At the moment, set order depends on hash values and the history of the set, so two identically equal sets with identical objects with identical hashes could display differently, according to the sets’ histories.

>>> a = set([-2**64 + 7, 0])
>>> b = set([0])
>>> a.add(-1); a.add(-2)
>>> b.add(-1); b.add(-2)
>>> a.remove(-2**64+7)
>>> a; b
{0, -2, -1}
{0, -1, -2}

The internal order of sets also depends on the set implementation, and that could change. Since sets are explicitly documented as being unordered, with no guaranteed iteration order, the implementation is subject to change at any time without warning, including in bug fix releases.

Because sets are documented as being unordered, in theory the iteration order could even change from one call to another: str(a) == str(a) is not guaranteed to be true, even if the set a doesn’t change from one call to the next. If it happens to be true right now, that’s an accident of the implementation, not a promise made by the language.

If you are relying on tricks or hacks to guarantee set order, you are on thin ice, and they could stop working at any time.

I don’t think there should be.

If such a thing was possible, without major performance costs, why would we make it optional? We should just make sets ordered all the time.

If making sets ordered carries a large performance cost, then we should still not make it optional. That can only cause confusion and bugs:

  • “Why does the code I copied from StackOverflow or this blog not work?” (Because you aren’t running with the ordered set switch.)
  • “Why does my code take 30 minutes to run instead of 30 seconds?” (Because you are running with the ordered set switch.)

We should instead provide a separate OrderedSet, just as we provided a separate OrderedDict for many years.

That will allow developers to choose the parts of their code that need sets with predictable iteration order, from those that don’t.

It does for me.

(venv) ... % for a in `seq 0 6`; do python3.10 -c "print(hash(None))"; done
268969196
269474028
269586668
273993964
268827884
273534188
274033900

The history of the set will be the same every run. It is computed from the same originating iterable. Inductively all histories will stay the same as long as nothing (like a non-deterministic hash) causes them to diverge.
Of course they could diverge because of other forms of non-determinism. But to beat non-determinism we must eliminate all sources of it. This is one of them.

The hash changes if Python is built with adress space layout randomization, a common security hardening option.
On Fedora:

$ for a in `seq 0 6`; do python3 -c "print(hash(None))"; done
8783406253026
8747164058594
8748255495138
8728900617186
8765761078242
8754318099426
8732784542690
1 Like

Having the hash of None be a fixed value seems harmless enough. I see no point in the complexity of having an option here. We should either do it or not.

Although note that I said “seems harmless” above. People can create exploits out of the weirdest things, so someone (with the appropriate expertise) should review such a change for security risks.

While I see no problem with giving None a well-defined hash, this does not mean I support the idea of “eliminating all sources of non-determinism”. As noted, things like address space randomisation introduce deliberate non-determinism to address security risks, and that’s far more important in any practical sense than the (unspecified) use cases for “determinism between runs”.

I didn’t mean it’s always a desired goal to get rid of all non-determinism, but sometimes it is, and an unstable hash for None (as a side effect of ASLR) is making life harder in that regard.

For many (most) use cases, it is “don’t care” whether None has a deterministic hash value or not.

A stable hash for None is safe from the “complexity attack” issue that plagues strings, because it is a singleton.
But, I’m far from being a security expert, so let’s hear if there’s any other potential issue with it …

1 Like

Another thing to consider is that hash(True) = 1 and hash(False) = 0.
None being a unit type with hash(None) = 0xbadcab1e or such isn’t that different security-wise, is it?

That sounds right. I don’t think this would need a review from a dedicated security expert.
(As for ASLR – not revealing the randomized address might be a slight win, actually.)

I posted the PR gh-99540: Constant hash for _PyNone_Type by yonillasky · Pull Request #99541 · python/cpython · GitHub.
It is now awaiting CR.

There were some test failures (on an Azure target once, and WIn32…that I couldn’t repro locally), but it’s flaky; after another rebase, it all passes.

I don’t have any objections to the change, but the pitch is based on wrong assumption that set iteration order depends on the hash. That is a cpython implementation detail that you cannot rely on.

The pitch probably should be changed to the security wins that @encukou mentioned or something else.

You are stating the wrong assumption, and then point out that it’s wrong.

The real assumption is that if we do set operations on the same inputs in the same order, AND the hashes are the same, we will get the same iteration order on the output
and that assumption is valid (at least in CPython)

I also don’t understand why my pitch being CPython-specific is a problem. It’s a code change to CPython. I am not amending the standard to require any form of iteration order stability on builtin sets. There is no effect on other implementations of Python.

It is not a big problem, but it feels weird to me that when people would look at the issue later to see why this change was implemented, they would find that it was done to satisfy people who are doing the wrong thing by relying on implementation details

1 Like

Regarding this, an OrderedSet is a stronger requirement than what I want. It requires the set to iterate in a specific order, not just any order that is deterministic given an identical history of operations.

By the way, I already wrote such a set as an adapter over dict[T, None], and it had significant perf overhead. I’m sure I could do better with a pure builtin in C, and perhaps it’s worth developing it, but it’s still probably going to be a perf tradeoff to use it. Not to mention there will be an endless debate about whether this set is worth its weight in extra core code (why hasn’t it been added all those years…?)

Regarding the possibility of future impl changes: Sure. Worst case is that I am forced to migrate my entire codebase to use some form of OrderedSet all throughout.
For the time being, I have an easier solution, which has zero external cost.

I’ll also add that such OrderedSet migrations require automated code inspection/editing to enforce. In a large codebase at least, it’s all too easy for someone to add a seemingly harmless set comprehension somewhere and think nothing of it.

IMO it was a good suggestion to explore, but ultimately, looks like isn’t worth it. I’m sorry if it looks like we lead you down the wrong path.

Assuming that a set with the same hashes and order of construction is relying an implementation detail that might change in the future. It doesn’t need to be a change in CPython itself – some extra caching or serialization might re-create the object and ruin the determinism, because set order is not important.
Making the None hash constant would lead more people down this path. That’s the external cost of the change – definitely not zero.

The change wouldn’t be CPython-specific. Doing it in CPython would either mean people would start relying on it, which would force other implementations to follow suit – or people wouldn’t rely on it, which would mean the change wasn’t useful.

The security angle (not leaking the address of None) might hold some weight, but I don’t quite know enough about security to be sure it actually applies. And it wouldn’t save you from relying on fragile implementation details.

3 Likes

Totally agree with this. I believe in the ordered-set thread, people were asking for real life examples.

Regarding this: I don’t expect code itself to rely on it (that would be bad - and I honestly don’t see how or why anyone would do that in practice - can you think of a non-contrived example of that…? really asking).

The use case I described was to run a certain “compute only” CLI program on a certain input with random seeds set to a constant and expect identical behavior every run. The behavior would be relied upon to make such runs determinstic. If this breaks in the future, the small set of people who were even aware of this possibility (and used it to their advantage) now have their stable repros not so stable.

Too bad for them, but they are no worse off, than the baseline we started from, where the hash of None is always unstable…it just means they have to turn to one of the other, more expensive ways of solving the problem (Unit or OrderedSet migrations)

Why not just do this in the first place?

Because it’s more expensive? It requires enforcing a non-standard idiom over an entire codebase.

OrderSet especially. Like I said, in a large codebase, it’s way too easy for someone to introduce a set comprehension somewhere in the calculation and defeat the entire effort.

Besides, why do something the hard way first?

More expensive that getting a change made to CPython and then waiting to run your code only on Python 3.12+?

2 Likes