hash(None) Mk.2

I don’t mean to spam the forum, but I was asked to post about it here again.

NOTE: I sent this exact message below to the python-dev mailing list]

A proposal to modify None so that it hashes to a constant


I wrote a doc stating my case here:

Briefly,

  1. The main motivation for it is to allow users to get a predictable result on a given input (for programs that are doing pure compute, in domains like operations research / compilation), any time they run their program. Having stable repro is important for debugging. Notebooks with statistical analysis are another similar case where this is needed: you might want other people to run your notebook and get the same result you did.

  2. The reason the hash non-determinism of None matters in practice is that it can infect commonly used mapping key types, such as frozen dataclasses containing Optional[T] fields.

  3. Non-determinism emerging from other value types like str can be disabled by the user using PYTHONHASHSEED, but there’s no such protection against None.

All it takes is for your program to compute a set somewhere with affected keys, and iterate on it - and determinism is lost.

The need to modify None itself is caused by two factors

  • Optional being implemented effectively as T | None in Python as a strongly established practice

  • The fact that __hash__ is an intrinsic property of a type in Python, the hashing function cannot be externally supplied to its builtin container types. So we have to modify the type None itself, rather than write some alternative hasher that we could use if we care about deterministic behavior across runs.

This was debated at length over the forum and in discord.
I also posted a PR for it, and it was closed, see:

Asking for opinions, and to re-open the PR, provided there is enough support for such a change to take place.

To be honest, at this point I think you’re wasting your time (and other people’s). I recommend you let this drop.

3 Likes

That makes sense.
can I delete the topic?

Only because there’s a general tendency to yell at ideas until they go away, which isn’t exactly a good policy, but it’s an effective way of wielding “status quo wins a stalemate” to win arguments. See the previous thread for details, or look at any of myriad other places where an idea was killed the same way.

1 Like

Yes, and given this is policy-driven, there is no debate to be had and he is correct, it’s just wasting everyone’s time

The policy is “status quo wins a stalemate”. Not “status quo wins any debate if we can just tire out everyone else”. But whatever, if you’re not going to push it, probably nobody will.

1 Like

Not only because of that. The issue and PR the OP raised were closed by a core developer with an explanation of their reasoning. No-one suggested there that the OP re-open the discussion here, so I’m not clear where they are referring to with “I was asked to post about it here again”.

Honestly, I’m mostly indifferent to this change myself, it seems small and relatively innoccuous. But conversely, it doesn’t seem sufficiently important to warrant overturning another core dev’s decision.

Yes, ideas do sometimes get shouted down in favour of the status quo. I’m more worried when significant proposals with decent use cases get shouted down.

3 Likes

Oh, sorry. I only barely skimmed the issue and assumed that someone HAD asked for it to be re-discussed here, else that wouldn’t have been said in the OP.

There was an extended discussion of this on discord, someone there insisted that I open another thread.
No matter. We can let it die now.

Exactly. The same “shout till people go away” strategy has been used on a number of proposals. It’s like posting an idea is treated as a gauntlet that has to be run - not of technical merit, but of pure endurance. An idea is worth implementing only if someone has the fortutide to keep driving it despite massive opposition.

1 Like

I agree that is bad. But it’s equally bad if people propose ideas with little or no justification, and no appreciation of the costs of their proposal. Making people “run the gauntlet” is a bad way of weeding out bad proposals, but we don’t seem very good at finding a better way (I’ve seen far too many “I think it would be neat if Python did such-and-such” proposals that wasted way too much of people’s time because they didn’t get shut down fast enough or effectively enough).

2 Likes

Pardon me, but what IS the cost of implementing my proposal?
You keep saying “there is always a cost”. Seems almost like a mystical belief. Do you perhaps care to elaborate… since I see this discussion isn’t dying out?

Why not simply explain what the cost actually is, and then reject the proposal based on that

1 Like

It was discussed in the last thread, and mentioned in the reasons for closing the issue. I don’t plan on repeating it here, sorry.

Ah yes, people will rely on it
No one knows how or why but they will rely on it

OK… the thread can die now
I will stop pestering you guys

Some costs are one-off up-front costs. Some are on-going costs.

  • In this specific example, address randomization is used for security reasons. Is it safe to change the hash of None to not rely on its address? Don’t know. A security expert will need to consider it.

  • Somebody has to implement it.

  • Somebody has to document it.

  • Somebody has to write tests for it.

  • Those tests have to run every time we run the test suite, and on the CI server.

  • Now it is a feature of the language that every other Python interpreter has to implement.

  • And that people have to learn.

  • Every extra branch or line of code adds more places that bugs can occur.

  • In most cases, new features add code to the interpreter, making it bigger. If that feature isn’t generally useful, it becomes just bloat.

  • Once we’ve given None a constant hash, what about other singletons? In a month, or a year, somebody will be back with a feature request to make Ellipsis or NotImplemented behave like None.

  • And every new feature comes with some risk: if we make a mistake, some unforeseen problem occurs because of this (I can’t see what that might be, but that’s the problem with unforeseen problems) we’re stuck with it for a long deprecation period before we can remove it.

These costs might be small. Okay, that’s great! That is a point in your proposal’s favour. But balanced against the (hypothetically) small costs is that the benefit is likewise small.

As I pointed out in the other thread you have at least three options for avoiding this issue, and even if we agree to the proposal you won’t get your ultimate aim – consistent set iteration order across separate runs.

You might get something which looks like consistent set order by accident, but it will be unsafe and could be broken at any time. In a year, you’ll be back complaining that despite None’s consistent hash, a bug fix point fix broke your set order consistency, and we’ll say “We told you so!” and then we’ll need to have another forty or fifty post thread about why sets are unordered and why its okay to change set iteration order in a bug fix release.

So we have to make a judgement. Small cost, versus small benefit. Which wins? If the cost is higher than the benefit, then this proposal makes Python worse rather than better. If you could find even one single core developer who is willing to champion a PEP, you might have a chance.

Or it might be that in a month or a year or five years, some core developer will be annoyed enough by None hashing under address randomization that he or she will just go ahead and implement it, vindicating your position, and you can come back and tell us “Told you so!”.

The process is not perfect, and it is often annoyingly conservative. I’ve seen many proposals get rejected for many years (e.g. the ternary if operator) until something changes and we suddenly accept it.

But that’s the thing: errors of ommission (failed to add a good feature) are much less important than errors of commission (added a bad feature where the costs are higher than the benefit).

8 Likes

At least you are making some sort of argument that I can respond to. You really should have started with that.

If an operation returns a constant result (as can be observed from the source code, which is open), running it by definition confers no information to an attacker. I don’t need to be a security expert to know that. If anything, it is the default object hash being returned on statically allocated objects that’s the security risk, since it basically tells you where in memory the Python binary was loaded into

  • Somebody has to implement it.
    I did

  • Somebody has to document it.
    Not too sure about that. I mean I wrote a line about it in the blurb. But since no one depends on this behavior for their code before or after the change, nothing bad would happen if we don’t tell them about it. It is also not a change to the requirements.

  • Somebody has to write tests for it.
    How? None is only equal to itself, right? So as far as requirements go, its hash can be any value that stays constant throughout the run. Pretty sure a literal int32_t constant does that. The only test we could write given the requirements is an assertion that hash(None) == hash(None) which is tautologically correct for a literal constant.

  • Those tests have to run every time we run the test suite, and on the CI server.
    I don’t think there is value in running code that checks tautologies in CI
    do you have tests today that check hash(None) == hash(None)? no? But what if there’s a bug in Py_HashPointer? It actually does something less trivial than to return a constant.
    If you don’t test it now, why test it after the change
    If you do test it now, what other tests are needed?

  • Now it is a feature of the language that every other Python interpreter has to implement.
    I disagree. There is no change to the requirements. They can implement it however they like.

  • And that people have to learn.
    No, people can stay oblivious to it.
    People also have to learn that hash(None) is not a constant. Some are very surprised by it. I know I was, and I’ve talked with enough other people to tell you I wasn’t the only one that was surprised.

To a Python dev who knew about it for 10 years now I’m sure it seems like an obvious thing, though.

  • Every extra branch or line of code adds more places that bugs can occur.
    This is true for any change including mine. Hard to see where in a return constant function we could hide a bug, but in general you are right.

  • In most cases, new features add code to the interpreter, making it bigger. If that feature isn’t generally useful, it becomes just bloat.
    something like 3 lines of bloat but yes, the cost is not zero.

  • Once we’ve given None a constant hash, what about other singletons? In a month, or a year, somebody will be back with a feature request to make Ellipsis or NotImplemented behave like None.
    None is different than the other because of how Optional is defined.
    Not that I even think it’s bad for other sentinel values to hash to constants. They probably should, it just doesn’t matter in practice.

  • And every new feature comes with some risk: if we make a mistake, some unforeseen problem occurs because of this (I can’t see what that might be, but that’s the problem with unforeseen problems) we’re stuck with it for a long deprecation period before we can remove it.
    There is no change to the requirements, therefore as a special case of that, no new feature.

But I also understand the general sentiment here, that’s why this thread should die. You will go on believing whatever it is you want to believe. This is a terrible change. These are not the droids you were looking for.

1 Like

I understand that and willing to take my chances. I explain why in the doc that I’ve linked to in the OP.
Pasting it below.

But if Python one day decides to make set iteration behavior non-deterministic, even given fixed operations history and fixed hash values, this change will become pointless

A few counter points,

  • That’s a hypothetical future scenario.
  • This line of reasoning only works if no mechanism remains in the language to prevent such non-determinism from applying (i.e. no facility like PYTHONHASHSEED to disable it). I strongly doubt such a sweeping change will be applied by force.
  • With the exception of super high performance, concurrent hash map implementations, I don’t know of any language where standard containers have this property. Even in languages that have such containers, there are alternatives that offer deterministic behavior.

This can ALREADY HAPPEN with integers. You’ve never explained why it’s such a horrific problem for None to have a constant hash, yet absolutely fine for integers to hash to themselves. Whether your argument is security, consistency, or anything else, start by figuring out that one.

2 Likes

I started by pointing out that for most users of Python, None’s hash is already constant. You have never explained why you want address randomisation switched on, but without the consequences of address randomisation.

You have written hundreds of words complaining about an optional feature that nobody is forcing you to use.

  • if you don’t want address randomisation, don’t use it;

  • if you don’t want your hash to depend on None, write your hash function so it doesn’t depend on None.

If there is some reason why you need address randomisation and you need to mix the hash of None into your hash function, you should tell us.

The potential security risk is not whether or not the hash value is visible by reading the source code. The potential security risk is that in circumstances where users are expecting address randomisation, None will hash to a constant known value instead of an unpredictable value.

Is that a security risk? I don’t know, I never would have guessed in a million years that prior to Python 3.2, the hashing function used for strings was a serious security risk. I’m not an expert. Neither are you, so I’m not going to accept your assurances that this is fine.

Address randomisation is used for security reasons so if we change something relying on that address randomisation, it is prudent to get a security expert to check it out first.

No one depends on this feature? Not even you? Then why do you care so much about it?

Suppose we agreed to implement this feature in 3.12.0a2. Great, you’re happy, now you can use it. And then, because it’s not a documented language feature, its just an implementation detail that “no one depends on”, it will be fine to remove it again in 3.12.0b. Right?

Probably not.

And that’s why it needs to be documented and tested as a language feature, otherwise its just an implementation detail that any core developer can change at any time.

And if they do, you will be back here complaining that your code depends on that undocumented implementation detail and we broke your code by changing it.

Just like you have been complaining now.

That assertion is already true today, even with address randomisation.

The assertion that you actually need is that the hash value stays constant across multiple runs.

1 Like

Address randomization is on by default, isn’t it?

rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5889358057582
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5887077081966
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5907908226158
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5929608775022
rosuav@sikorsky:~$ python3 -c 'print(hash(None))'
5911443765614

So disabling it would require building Python especially for the process, and that is easily able to introduce unintended errors.