Attention from a security expert needed please

steven.daprano · November 30, 2022, 7:35am

There are a few long running threads occurring here (Discuss), on the bug tracker, and the python-dev mailing list, about replacing the hash of None with a constant rather than the current algorithm. I’m not going to link to them, as they are long and tedious and are generating more heat than light. (One of them has already been locked by the Discuss mods.)

I would like to focus on one, very narrow, aspect of that discussion, namely the security implications.

Some Python builds use Address Space Layout Randomisation (ASLR), which consequently gives None a different address each time it the interpreter runs, and hence a different hash.

Without necessarily weighing in on any other aspect of the proposal to give None a constant hash, my question for security guys is: given that ASLR is used for security purposes, are there any security concerns with giving None a constant hash?

I can’t see any, but what do I know? I didn’t see the security issues behind string hashing or int to decimal string conversions either

Thanks in advance.

gpshead · November 30, 2022, 8:28am

Your question doesn’t make sense to me as it has nothing to do with ASLR.

So no there are no security concerns with the hash value of None. Either as object identity or as a constant.

vstinner · November 30, 2022, 10:19am

I don’t understand the relationship between None hash value and security. What is the attack vector and security model of the None hash value?

The hash DoS vulnerability was about HTTP parsers using untrusted strings as keys in a dictionary and there was a worst case complexity which can be used to create a denial of service.

By the way, integers hashes are not randomized.

steven.daprano · November 29, 2022, 1:33pm

Sorry Gregory, I thought I had explained it in enough detail. The value of hash(None) depends on whether or not the interpreter is built with ASLR.

Without ASLR, hashing None returns the same value on every run:

$ for a in `seq 0 5`; do python3.10 -c "print(hash(None))"; done
472820
472820
472820
472820
472820
472820

With ASLR, hashing None returns a different value on every run:

$ for a in `seq 0 5`; do python3.7 -c "print(hash(None))"; done
8729577290887
8781152788871
8787017625991
8749338663047
8736773962375
8779554604423

Are there security implications of having hash(None) always return a known constant? I would think not, but then a few months ago I never would have guessed that there were security implications of str(some_int).

If you still stand by your answer of no security consequences, that is one fewer argument against the proposal to have hash(None) return a constant.

Thanks for your time.

steven.daprano · November 29, 2022, 1:45pm

Sorry Victor if I wasn’t clear enough, I’m not saying that there are security issues to do with hashing None, I’m asking if there might be.

To make it clear, the relationship is this:

Address randomisation is used for security hardening, but it also leads to the hash of None changing, which broke somebody’s code. They were relying on hash(None) always having the same value. Since that was broken by a security feature, the question posed on the Discuss thread was whether there is some need for hash(None) to change, or whether it is just an accident of the address randomisation.

Thanks for your response.

vstinner · November 30, 2022, 11:04am

I gave a link to the only known vulnerability related to hash(object). It has been fixed by randomizing hash(str) and hash(bytes).

gpshead · December 1, 2022, 12:23am

nit: That code was relying on something being constant across runs that was never guaranteed to be. So it was technically broken by an incorrect assumption. Not by a security feature.

Can Python’s default object hash based on the memory address lead to leaks of the process ASLR details? Yes.

I expect there to be many points where ASLR details are revealed in typical applications, certainly so in CPython. In security terms when crafting exploits, I’ve heard these points called an oracle. Thus I do not consider it to be a big deal.

If someone wants to minimize these I suggest doing it from a holistic point of view to try and minimize an entire class of information leaking oracles, ideally ones we have evidence of actually being used in CPython application attacks for that purpose, rather than picking one singleton instance, pun intended, to do it on. Doing so is likely an unwinnable whack-a-mole game, is that the best place to spend effort?

Explicit ordering philosophy digression:

This kind of assumption being baked into code because it was observed to be true in a development environment and thus assumed to be guaranteed - a logical fallacy, that most of us are guilty of - is convenient and natural. Humans aren’t known for logic.

It is an example of why I was generally against adopting iteration order being insertion order in dicts as a feature: It violates “explicit is better than implicit”. We later decided that it is a guaranteed behavior to be relied upon in 3.6+ (documentation wise: 3.7+). I’d personally have preferred an explicit request for insertion order iteration or, better, an explicit request for iteration to be insertion order upon dict creation. Water under the bridge, the opportunity to do that has passed, it’d be unnecessarily disruptive to change.

Anywhere we don’t guarantee a specific behavior related to ordering or hashing we IMNSHO should ideally insert at least a little randomness by default to avoid any code from depending on the implementation details that observed behavior in one environment or implementation is more than chance. (ex: wrongly assuming constant hashes, wrongly assuming that ordering in unordered containers is canonical based upon either the contents or the sequence of operations leading up to the iteration, etc.)

Even silly seeming little tweaks to iteration like randomizing the starting point and flipping a coin to pick forward vs backwards iteration are enough to prevent most people from writing code assuming order when none is guaranteed without meaningfully impacting iteration performance.

I don’t expect everyone to share this view. It is from learned experience managing a huge codebase for over a decade. Our hash randomization by default cleanup from the 2.7 era at work identified and fixed real bugs in code. There isn’t anything preventing those types of bugs from returning now.

Rosuav · December 1, 2022, 2:32am

I’m glad nobody was advocating this for dictionaries, since Python explicitly guarantees that this won’t happen for them - a guarantee that significantly predates the actual ordering guarantees that we have now. The oldest docs I have available are for 2.7, but this was here much earlier than that:

https://docs.python.org/2.7/library/stdtypes.html#dict.items

If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond.

This precludes any sort of randomization on iteration, so at best, all you’d be able to do is something like “every time the dictionary is mutated, introduce some randomness”. And if you were to introduce something like that, I can pretty much guarantee you that there’d be people calling for a way to set the seed (just like with string hash arandomization) in order to retain reproducibility.

pitrou · December 1, 2022, 10:55am

I’d turn the question on its head: is there any need for hash(None) not to change?

pf_moore · December 1, 2022, 11:04am

Only the generic reasons:

Using something other than the default (object address) requires some explicit code to be added.
Someone has to choose a hash value (which is arguably trivial, but maybe we should avoid 0 or 1, so maybe not totally trivial).

The PR doing this was rejected, so @rhettinger felt there were enough reasons to say no. As far as I can see, no other core dev feels strongly enough to argue. I certainly don’t.

pitrou · December 1, 2022, 11:16am

You’re answering the question “why did it not change before?”; not the question I asked: “does it need not to change?”.

pf_moore · December 1, 2022, 11:42am

No, it doesn’t need not to change (as long as no-one has a security concern, which was the point of this thread, and which I can’t answer other than to say “I’m not an expert but it looks OK to me”).

steven.daprano · December 1, 2022, 4:10am

Now that the security issue seems to have been settled, probably not.

Are you volunteering to do the work (and possibly get into an argument with Raymond and Brett who have come out in opposition to it)?

We know that the driving motivation for this is one person who wants to be able to rely on that hash being constant, we’ll need to document this as a feature, not just an accidental implementation detail.

Otherwise somebody might come along and say “is there any reason not to change this to a different value?”

ronaldoussoren · December 2, 2022, 8:05am

I guess this is off-top for this thread, but even with a change to the hash of None the stability of set iteration that the OP asks for would be limited because the hash function used for strings is not documented as stable, and more importantly is documented as build-time changeable option (PEP 456). That hash function has changed before and can change in the future if someone finds a better hash function.

Rosuav · December 2, 2022, 8:20am

See other thread. Stability across versions is not a requirement here and the possibility for the hash function to change does not affect determinism.

steve.dower · December 2, 2022, 12:58pm

Some of us did, for exactly the reasons Greg calls out. Iteration randomisation per-process was the proposal, mainly to ensure that test suites didn’t take an accidental dependency on insertion ordering that would prevent us from adding optimisations in the future (e.g. FrozenMap couldn’t have been frozendict because - among other things - it couldn’t cheaply preserve insertion order, which we’d guaranteed for dict and assumed would be implied for specialised dicts).

The proposed way to get reproducibility was to use OrderedDict, which would’ve been dict with a “do not randomise” flag, at least until we found a way to make dict even more efficient by changing the ordering.

But it was certainly advocated at the time.

Rosuav · December 2, 2022, 1:11pm

Per-process is at least sane, if fairly unhelpful. But the proposal to have the iteration order change each iteration is fundamentally broken.

gpshead · December 3, 2022, 10:31pm

False. Even with our age old parallel iteration ordering guarantees between our iteration APIs when the dict has not been modified, it could still have been done while maintaining that invariant: Gate the selection of a new iteration seed on dict modification.

Since 3.7 we decided that’d never happen as our new dict implementation in 3.6 offered order as a convenient side effect.

Always random iteration could still be implemented for set and would reduce bugs in peoples code.

Even while agreeing with Guido that it’d be nice if set had the same insertion order based iteration as dict if it didn’t cost us performance or memory. Until such a thing exists, preventing non-explicit order reliance bugs should be considered a desirable feature.

Other languages have done it. Go did it in 1.0.

“When iterating over a map with a range loop, the iteration order is not specified and is not guaranteed to be the same from one iteration to the next. Since the release of Go 1.0, the runtime has randomized map iteration order.” - Golang

We’ve been migrating not explicitly ordered C++ containers to work this way at work as well, at least during testing. It exposes hidden undeclared order dependence bugs.

Rosuav · December 3, 2022, 10:35pm

The proposal was that every call to iter() would rerandomize. You keep claiming that my statement is false by proving that a DIFFERENT randomization process would be perfectly acceptable. I stand by my ORIGINAL STATEMENT that the proposal would violate invariants, and yet people keep telling me that I’m flat-out wrong.

Sigh. This is getting tiresome.

gpshead · December 3, 2022, 11:15pm

the proposal

We’re just talking past each other. I never actually made a proposal of a fleshed out design. So what people in this thread are referring to as “the proposal” isn’t something concrete that we’re even all in agreement on.

Why do I feel respond at all then? I’m attempting to be constructive to show how it can be done. I disagree with wording such as “fundamentally flawed” in this context because without a concrete design being discussed, it isn’t fair. The entire concept of intentionally unstable iteration order used to help prevent undeclared order dependence bugs in code is widely deployed and proven, not fundamentally flawed. Which is my way of interpreting how such words, left to themselves, would be read by some set of others.