hash(None) Mk.2

yonillasky · November 27, 2022, 5:24pm

I don’t mean to spam the forum, but I was asked to post about it here again.

NOTE: I sent this exact message below to the python-dev mailing list]

A proposal to modify `None` so that it hashes to a constant

I wrote a doc stating my case here:

Briefly,

The main motivation for it is to allow users to get a predictable result on a given input (for programs that are doing pure compute, in domains like operations research / compilation), any time they run their program. Having stable repro is important for debugging. Notebooks with statistical analysis are another similar case where this is needed: you might want other people to run your notebook and get the same result you did.
The reason the hash non-determinism of None matters in practice is that it can infect commonly used mapping key types, such as frozen dataclasses containing Optional[T] fields.
Non-determinism emerging from other value types like str can be disabled by the user using PYTHONHASHSEED, but there’s no such protection against None.

All it takes is for your program to compute a set somewhere with affected keys, and iterate on it - and determinism is lost.

The need to modify None itself is caused by two factors

Optional being implemented effectively as T | None in Python as a strongly established practice
The fact that __hash__ is an intrinsic property of a type in Python, the hashing function cannot be externally supplied to its builtin container types. So we have to modify the type None itself, rather than write some alternative hasher that we could use if we care about deterministic behavior across runs.

This was debated at length over the forum and in discord.
I also posted a PR for it, and it was closed, see:

github.com/python/cpython

Constant hash value for None to aid reproducibility

opened 06:55PM - 16 Nov 22 UTC

closed 08:36PM - 16 Dec 22 UTC

yonillasky

type-feature

# Feature or enhancement Fix `hash(None)` to a constant value. # Pitch …(Updated 2022.11.18) - Under current behavior, the runtime leaks the ASLR offset, since the original address of the `None` singleton is fixed and `_Py_HashPointerRaw` is reversible. Admittedly, there are other similar objects, like `NotImplemented` or `Ellipsis` that also have this problem, and need to be similarly fixed. - Because of ASLR, `hash(None)` changes every run; that consequently means the hash of many useful "key" types changes every run, particularly tuples, NamedTuples and frozen dataclasses that have `Optional` fields. - The other source of hash value instability across runs in common "key" types like str or Enum, can be fixed using the `PYTHONHASHSEED` environment var. - other singletons commonly used as (or as part of) mapping keys, `True` and `False` already have fixed hash values. CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated. This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind. There is a counterargument that we should simply just offer `StableSet` and `StableFrozenSet` that guarantee a specific order, the same way that `dict` does. A few things to note about that: - I've written such set classes as an adapter over `dict[T, None]`, there is a substantial perf overhead to that - Is it worth the extra "weight" in code inside the core? That's suspect - why hasn't it been added all those years? - In a large codebase, it requires automated code inspection and editing tools to enforce this. It's all too easy, and natural, to add a seemingly harmless set comprehension somewhere and defeat the whole effort - The insertion-order-as-iteration-order guarantee is stronger than what we actually require, in order to have the "reproducability" property I've described, so we're paying extra for something we don't really need. My PR makes a small change to CPython, in `objects.c`, that sets the `tp_hash` descriptor of `NoneType` to a function that simply returns a constant value. Admittedly, determinism between runs isn't a concern that most users/programs care about. It is rather niche. However, I argue that still, there is no externalized cost to this change. # Previous discussion https://discuss.python.org/t/constant-hash-for-none/21110 ### Linked PRs * gh-99541

Asking for opinions, and to re-open the PR, provided there is enough support for such a change to take place.

pf_moore · November 27, 2022, 10:31pm

To be honest, at this point I think you’re wasting your time (and other people’s). I recommend you let this drop.

yonillasky · November 27, 2022, 10:33pm

That makes sense.
can I delete the topic?

Rosuav · November 27, 2022, 10:39pm

Only because there’s a general tendency to yell at ideas until they go away, which isn’t exactly a good policy, but it’s an effective way of wielding “status quo wins a stalemate” to win arguments. See the previous thread for details, or look at any of myriad other places where an idea was killed the same way.

yonillasky · November 27, 2022, 10:46pm

Yes, and given this is policy-driven, there is no debate to be had and he is correct, it’s just wasting everyone’s time

Rosuav · November 27, 2022, 10:58pm

The policy is “status quo wins a stalemate”. Not “status quo wins any debate if we can just tire out everyone else”. But whatever, if you’re not going to push it, probably nobody will.

pf_moore · November 27, 2022, 10:58pm

Not only because of that. The issue and PR the OP raised were closed by a core developer with an explanation of their reasoning. No-one suggested there that the OP re-open the discussion here, so I’m not clear where they are referring to with “I was asked to post about it here again”.

Honestly, I’m mostly indifferent to this change myself, it seems small and relatively innoccuous. But conversely, it doesn’t seem sufficiently important to warrant overturning another core dev’s decision.

Yes, ideas do sometimes get shouted down in favour of the status quo. I’m more worried when significant proposals with decent use cases get shouted down.

Rosuav · November 27, 2022, 11:01pm

Oh, sorry. I only barely skimmed the issue and assumed that someone HAD asked for it to be re-discussed here, else that wouldn’t have been said in the OP.

yonillasky · November 27, 2022, 11:02pm

There was an extended discussion of this on discord, someone there insisted that I open another thread.
No matter. We can let it die now.

Rosuav · November 27, 2022, 11:03pm

Exactly. The same “shout till people go away” strategy has been used on a number of proposals. It’s like posting an idea is treated as a gauntlet that has to be run - not of technical merit, but of pure endurance. An idea is worth implementing only if someone has the fortutide to keep driving it despite massive opposition.

pf_moore · November 27, 2022, 11:06pm

I agree that is bad. But it’s equally bad if people propose ideas with little or no justification, and no appreciation of the costs of their proposal. Making people “run the gauntlet” is a bad way of weeding out bad proposals, but we don’t seem very good at finding a better way (I’ve seen far too many “I think it would be neat if Python did such-and-such” proposals that wasted way too much of people’s time because they didn’t get shut down fast enough or effectively enough).

yonillasky · November 27, 2022, 11:08pm

Pardon me, but what IS the cost of implementing my proposal?
You keep saying “there is always a cost”. Seems almost like a mystical belief. Do you perhaps care to elaborate… since I see this discussion isn’t dying out?

Why not simply explain what the cost actually is, and then reject the proposal based on that

pf_moore · November 27, 2022, 11:13pm

It was discussed in the last thread, and mentioned in the reasons for closing the issue. I don’t plan on repeating it here, sorry.

yonillasky · November 27, 2022, 11:24pm

Ah yes, people will rely on it
No one knows how or why but they will rely on it

OK… the thread can die now
I will stop pestering you guys

steven.daprano · November 27, 2022, 3:29am

Some costs are one-off up-front costs. Some are on-going costs.

In this specific example, address randomization is used for security reasons. Is it safe to change the hash of None to not rely on its address? Don’t know. A security expert will need to consider it.
Somebody has to implement it.
Somebody has to document it.
Somebody has to write tests for it.
Those tests have to run every time we run the test suite, and on the CI server.
Now it is a feature of the language that every other Python interpreter has to implement.
And that people have to learn.
Every extra branch or line of code adds more places that bugs can occur.
In most cases, new features add code to the interpreter, making it bigger. If that feature isn’t generally useful, it becomes just bloat.
Once we’ve given None a constant hash, what about other singletons? In a month, or a year, somebody will be back with a feature request to make Ellipsis or NotImplemented behave like None.
And every new feature comes with some risk: if we make a mistake, some unforeseen problem occurs because of this (I can’t see what that might be, but that’s the problem with unforeseen problems) we’re stuck with it for a long deprecation period before we can remove it.

These costs might be small. Okay, that’s great! That is a point in your proposal’s favour. But balanced against the (hypothetically) small costs is that the benefit is likewise small.

As I pointed out in the other thread you have at least three options for avoiding this issue, and even if we agree to the proposal you won’t get your ultimate aim – consistent set iteration order across separate runs.

You might get something which looks like consistent set order by accident, but it will be unsafe and could be broken at any time. In a year, you’ll be back complaining that despite None’s consistent hash, a bug fix point fix broke your set order consistency, and we’ll say “We told you so!” and then we’ll need to have another forty or fifty post thread about why sets are unordered and why its okay to change set iteration order in a bug fix release.

So we have to make a judgement. Small cost, versus small benefit. Which wins? If the cost is higher than the benefit, then this proposal makes Python worse rather than better. If you could find even one single core developer who is willing to champion a PEP, you might have a chance.

Or it might be that in a month or a year or five years, some core developer will be annoyed enough by None hashing under address randomization that he or she will just go ahead and implement it, vindicating your position, and you can come back and tell us “Told you so!”.

The process is not perfect, and it is often annoyingly conservative. I’ve seen many proposals get rejected for many years (e.g. the ternary if operator) until something changes and we suddenly accept it.

But that’s the thing: errors of ommission (failed to add a good feature) are much less important than errors of commission (added a bad feature where the costs are higher than the benefit).

yonillasky · November 28, 2022, 7:01am

At least you are making some sort of argument that I can respond to. You really should have started with that.

If an operation returns a constant result (as can be observed from the source code, which is open), running it by definition confers no information to an attacker. I don’t need to be a security expert to know that. If anything, it is the default object hash being returned on statically allocated objects that’s the security risk, since it basically tells you where in memory the Python binary was loaded into

Somebody has to implement it.
I did
Somebody has to document it.
Not too sure about that. I mean I wrote a line about it in the blurb. But since no one depends on this behavior for their code before or after the change, nothing bad would happen if we don’t tell them about it. It is also not a change to the requirements.
Somebody has to write tests for it.
How? None is only equal to itself, right? So as far as requirements go, its hash can be any value that stays constant throughout the run. Pretty sure a literal int32_t constant does that. The only test we could write given the requirements is an assertion that hash(None) == hash(None) which is tautologically correct for a literal constant.
Those tests have to run every time we run the test suite, and on the CI server.
I don’t think there is value in running code that checks tautologies in CI
do you have tests today that check hash(None) == hash(None)? no? But what if there’s a bug in Py_HashPointer? It actually does something less trivial than to return a constant.
If you don’t test it now, why test it after the change
If you do test it now, what other tests are needed?
Now it is a feature of the language that every other Python interpreter has to implement.
I disagree. There is no change to the requirements. They can implement it however they like.
And that people have to learn.
No, people can stay oblivious to it.
People also have to learn that hash(None) is not a constant. Some are very surprised by it. I know I was, and I’ve talked with enough other people to tell you I wasn’t the only one that was surprised.

To a Python dev who knew about it for 10 years now I’m sure it seems like an obvious thing, though.

Every extra branch or line of code adds more places that bugs can occur.
This is true for any change including mine. Hard to see where in a return constant function we could hide a bug, but in general you are right.
In most cases, new features add code to the interpreter, making it bigger. If that feature isn’t generally useful, it becomes just bloat.
something like 3 lines of bloat but yes, the cost is not zero.
Once we’ve given None a constant hash, what about other singletons? In a month, or a year, somebody will be back with a feature request to make Ellipsis or NotImplemented behave like None.
None is different than the other because of how Optional is defined.
Not that I even think it’s bad for other sentinel values to hash to constants. They probably should, it just doesn’t matter in practice.
And every new feature comes with some risk: if we make a mistake, some unforeseen problem occurs because of this (I can’t see what that might be, but that’s the problem with unforeseen problems) we’re stuck with it for a long deprecation period before we can remove it.
There is no change to the requirements, therefore as a special case of that, no new feature.

But I also understand the general sentiment here, that’s why this thread should die. You will go on believing whatever it is you want to believe. This is a terrible change. These are not the droids you were looking for.

yonillasky · November 28, 2022, 7:21am

I understand that and willing to take my chances. I explain why in the doc that I’ve linked to in the OP.
Pasting it below.

But if Python one day decides to make set iteration behavior non-deterministic, even given fixed operations history and fixed hash values, this change will become pointless

A few counter points,

That’s a hypothetical future scenario.
This line of reasoning only works if no mechanism remains in the language to prevent such non-determinism from applying (i.e. no facility like PYTHONHASHSEED to disable it). I strongly doubt such a sweeping change will be applied by force.
With the exception of super high performance, concurrent hash map implementations, I don’t know of any language where standard containers have this property. Even in languages that have such containers, there are alternatives that offer deterministic behavior.

Rosuav · November 28, 2022, 7:23am

This can ALREADY HAPPEN with integers. You’ve never explained why it’s such a horrific problem for None to have a constant hash, yet absolutely fine for integers to hash to themselves. Whether your argument is security, consistency, or anything else, start by figuring out that one.

steven.daprano · November 27, 2022, 1:22pm

I started by pointing out that for most users of Python, None’s hash is already constant. You have never explained why you want address randomisation switched on, but without the consequences of address randomisation.

You have written hundreds of words complaining about an optional feature that nobody is forcing you to use.

if you don’t want address randomisation, don’t use it;
if you don’t want your hash to depend on None, write your hash function so it doesn’t depend on None.

If there is some reason why you need address randomisation and you need to mix the hash of None into your hash function, you should tell us.

The potential security risk is not whether or not the hash value is visible by reading the source code. The potential security risk is that in circumstances where users are expecting address randomisation, None will hash to a constant known value instead of an unpredictable value.

Is that a security risk? I don’t know, I never would have guessed in a million years that prior to Python 3.2, the hashing function used for strings was a serious security risk. I’m not an expert. Neither are you, so I’m not going to accept your assurances that this is fine.

Address randomisation is used for security reasons so if we change something relying on that address randomisation, it is prudent to get a security expert to check it out first.

No one depends on this feature? Not even you? Then why do you care so much about it?

Suppose we agreed to implement this feature in 3.12.0a2. Great, you’re happy, now you can use it. And then, because it’s not a documented language feature, its just an implementation detail that “no one depends on”, it will be fine to remove it again in 3.12.0b. Right?

Probably not.

And that’s why it needs to be documented and tested as a language feature, otherwise its just an implementation detail that any core developer can change at any time.

And if they do, you will be back here complaining that your code depends on that undocumented implementation detail and we broke your code by changing it.

Just like you have been complaining now.

That assertion is already true today, even with address randomisation.

The assertion that you actually need is that the hash value stays constant across multiple runs.

Rosuav · November 28, 2022, 10:33am

Address randomization is on by default, isn’t it?

rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5889358057582
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5887077081966
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5907908226158
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5929608775022
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5911443765614

So disabling it would require building Python especially for the process, and that is easily able to introduce unintended errors.

hash(None) Mk.2

A proposal to modify None so that it hashes to a constant

A proposal to modify `None` so that it hashes to a constant