Using pointer to return result of long_hash

In some cases like hash(-1) , the hash function returns -2 instead of the correct result -1 because in cpython, the value -1 means exception. This problem has been already mentioned in Issue 127814 but this issue was closed because of compatibility. Now I have fixed it on my local environment by passing the hash result through a pointer parameter instead of return value and it runs perfectly. Can I push my code to the cpython repository? Should I edit relevant notes and documentation? Which documentation should I edit?

Wouldn’t this still introduce backward incompatibility as it’s stated in the linked issue ? (I’m assuming you’re returning a pointer to be able to have hash(-1) == -1)

Why do you care what hash(-1) returns?

1 Like

The “correct result” according to the documentation is hash(-1) == -2.

  • If x = m / n is a nonnegative rational number and n is not divisible by P, define hash(x) as m * invmod(n, P) % P, where invmod(n, P) gives the inverse of n modulo P.

…

What hash actually does should to be regarded as an implementation detail in any case. What inevitably-brittle code do you wish to write, that uses hash for your own purposes, that for some reason mustn’t use a simple wrapper function, that checks if x < 0 and returns whatever you want ?

How you “fixed” that?

It doesn’t make any sense to discuss unknown code. “Talk is cheap. Show me the code”

What will be hash(-1) on in your version?

@JamesParrott @skirpichev Thanks for your replies, next I will explain my thoughts.

What the problem actually is?
Not only -1, but any object with a hash value of -1 is returned with a value of -2 through the hash function. hash(-1) is just an example.

Why do I care it?
hash(-1) == hash(-2) is an anomaly. Python uses -1 as the return value of the function when the error occurs, but sometimes the hash value of the object is -1, so the object’s hash function has to return -2 in order for the original error mechanism to work. Although this “solution” has been documented and -2 has been seen as the “correct result”, this issue still leads to some problems and confusion, so I want to fix it.

What’s my solution?
My solution is to use a pointer parameter to pass the real hash value and use the return value to pass the error message.
For example, this is the definition of long_hash, which is used to calculate the hash value of a long object:

static Py_hash_t
long_hash(PyObject *obj)
{
    PyLongObject *v = (PyLongObject *)obj;
    Py_uhash_t x;
    Py_ssize_t i;
    int sign;

    if (_PyLong_IsCompact(v)) {
        x = (Py_uhash_t)_PyLong_CompactValue(v);
        if (x == (Py_uhash_t)-1) {
            x = (Py_uhash_t)-2;
        }
        return x;
    }
    i = _PyLong_DigitCount(v);
    sign = _PyLong_NonCompactSign(v);
    x = 0;
    while (--i >= 0) {
        x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) |
            (x >> (_PyHASH_BITS - PyLong_SHIFT));
        x += v->long_value.ob_digit[i];
        if (x >= _PyHASH_MODULUS)
            x -= _PyHASH_MODULUS;
    }
    x = x * sign;
    if (x == (Py_uhash_t)-1)
        x = (Py_uhash_t)-2;
    return (Py_hash_t)x;
}

Notice that when the real hash value x is -1, it will be changed into -2. I changed this code to something like this:

static int
long_hash(PyObject *obj, Py_hash_t *result)
{
    PyLongObject *v = (PyLongObject *)obj;
    Py_uhash_t x;
    Py_ssize_t i;
    int sign;

    if (_PyLong_IsCompact(v)) {
        x = (Py_uhash_t)_PyLong_CompactValue(v);
        *result = x;
        return 0;
    }
    i = _PyLong_DigitCount(v);
    sign = _PyLong_NonCompactSign(v);
    x = 0;
    while (--i >= 0) {
        x = ((x << PyLong_SHIFT) & _PyHASH_MODULUS) |
            (x >> (_PyHASH_BITS - PyLong_SHIFT));
        x += v->long_value.ob_digit[i];
        if (x >= _PyHASH_MODULUS)
            x -= _PyHASH_MODULUS;
    }
    x = x * sign;
    *result = x;
    return 0;
}

It will pass the result through the pointer parameter instead of the return value.
And the function which receives its result is edited like this:

static PyObject *
builtin_hash(PyObject *module, PyObject *obj)
{
    Py_hash_t x;

    if (PyObject_Hash(obj, &x) == -1)
        return NULL;
    return PyLong_FromSsize_t(x);
}

So hash will return the real hash value instead of a tampered one.

After this change, what will hash return?
It won’t return “whatever I want”, it will just return the object’s real hash value calculated through the PYTHON’S HASH ALOGRITHM. I just pass this result in another way without any modifications. For example, hash(-1) will return -1, which is the real value calculated by hash algorithm, but in the original code, it returns -2 which is modified to accommodate the error handling mechanism.

If you have a better solution, please share your idea!

I won’t “return” a pointer, instead I will pass the value through a pointer parameter to get a better compatibility.

I’m unclear what problems this leads to? Yes, people are surprised by it, but what actual problem does it cause? The hash() function isn’t required to produce unique values, and cannot in any case because its input space is far larger than its output space:

>>> hash(1)
1
>>> hash(2**61)
1
2 Likes

FWIW, hash(-1) was how I learned this, because I had written code which assumed the outputs were unique and someone showed me how it failed on -1.

I think having an easy collision case to point at is very valuable for pedagogical purposes.

3 Likes

If you have a better solution, please share your idea!

If you demonstrate your ‘solution’ solves an actual problem, not just an artificial non-problem, I’ll gladly give it more consideration. You still haven’t given an example of good code that depends on the values returned by hash, or explained why those of you that care can’t simply write a wrapper function. That would be a good start. Not to mention that exposing pointers can lead to all kinds of security issues. It’s much easier simply not to do so, and not have to think about that risk any further.

hash is not intended to be a cryptographically secure hash function (See hashlib for those). It’s pretty much just an implementation detail for set and dict.

What the problem actually is?
Not only -1, but any object with a hash value of -1 is returned with a value of -2 through the hash function. hash(-1) is just an example.

To put it bluntly, so what?

hash(-1) == hash(-2) is an anomaly.

It’s merely a curiosity. With educational value as Ned points out. It resulted from a design decision to avoid the error return value (another implementation detail).

this issue still leads to some problems and confusion,

What problems? Is there a bug or a performance penalty? If so, please provide an MRX or benchmarks respectively.

And confusion amongst whom? Among anyone other than those who are already confused enough to think hash must return anything meaningful?

it will just return the object’s real hash value

This is meaningless. Hash values are arbitrary, not real. Hash values for hash tables (such a dict and set) just need to produce a wide enough spread across the buckets, to achieve O(1) look up on average.

1 Like

This “problem” looks rather artificial for me. Just hash collisions aren’t a problem, as people already told you.

Please describe these “some problems”. The rest doesn’t make sense without this, as you suggest huge API break. You should have good arguments to show that this cure is really better than disease.

Well, you break tp_hash API, PyObject_Hash. Finally, you break algorithm for hashing integers and projects, that reproduce it for own integer types (like gmpy2).

I’m not sure, that there is something to be solved. The “problem” might be obvious to you, but clearly — not for many people in this thread.

What you propose is a major breaking change. long_hash is an implementation of the tp_hash protocol, it should have signature compatible with Py_hash_t (*)(PyObject *). Changing the tp_hash protocol in such way will break any extension class that implements hashing. PyObject_Hash() is a part of the public C API and ABI. Changing it will break virtually every extension and program that uses embedded Python. This will not happen. Not until we abandon an existing C API and a large part of Python ecosystem.

There is a more sane way to make the transition. We can introduce the tp_hash_new slot and new PyObject_HashNew() API, and make the CPython core maintaining synchronization between tp_hash_new and tp_hash, and use tp_hash_new first and tp_hash as a fallback. This will add code to handle special cases in some already complicated places, and it will left here for decades (tp_getattr is still here). It will still be not completely backward compatible, because third-party code can rely on details of the current behavior. All this hassle to solve a minor problem that may not exist and with no guarantee of a solution.

Personally, if I were doing it from scratch, I would make zero a special value, not -1, because many formulas that mix hashes degarde for zero (h1 + h2, h1 ^ h2, (h << n) | (h >> m), etc). It would also mean always using an unsigned type for the hash. But what’s done is done.

5 Likes

I completely agree with the rest of the post.

Question from a newbie: zero is not a more usual hash value? For example it’s the hash of False.

About unsigned type, the negative integers have a negative hash. If they have a positive hash, their hash will not collide with positive integers? I think it can also pose a restriction to hash algorithms.

1 Like

Returning -hash(-x+1) for negative x seems like a much less severe change

My question above is still unanswered:

No reason has been given to change anything so there is no need to discuss what the changes would be.

2 Likes

And hash(x) == 0 for any x such that x == 0. It doesn’t matter to this whether type(x) is int, bool, float,. complex, fractions.Fraction, or decimal.Decimal. If x == y, hash(x) == hash(y) regardless of the numeric types involved. This isn’t obvious, or “free”. A lot of work was done to make it true.

True! They will not. But I’ve never seen code that relies on that. And hope I never will: hash values are mostly accidents of implementation, and not guaranteed across implementations, or releases, or in some cases not even across two runs of the same program using a single release (think “hash randomization” for strings).

3 Likes

Oh, I’m not scared about randomness of hash value :slight_smile: My doubt was about uniformity, but I was not taking in consideration that int is unbounded in Python, so you have a nearly infinite values you can return for an hash function, even if you limit it to “only” positive integers. Am I right?

No. Hash function return values need to fit in 64bit (32bit on 32bit systems), since they might be passed through C code as none-PyObjects. But you are correct that python ints are unbounded, so some int values have to share hash values.

But uniformity is not broken by a single value being shifted around. O(1) “mistakes” don’t matter relative to O(2^32) or O(2^64).

3 Likes

@MegaIng is right. While nothing about this is guaranteed across releases, for ints i the hash value is actually

abs(i) % sys.hash_info.modulus

with the sign of i attached.

>>> import sys
>>> sys.hash_info.modulus
2305843009213693951
>>> _ == 2**61 - 1
True
>>> hash(sys.hash_info.modulus)
0
>>> hash(-sys.hash_info.modulus) # so there are negative ints with 0 hash
0
>>> hash(-sys.hash_info.modulus-1)
-2
>>> hash(-sys.hash_info.modulus-2)
-2

The modulus is both a prime and a solid string of 1 bits. The latter allows fast computation of the mod without using division (akin to “casting out 9s” in decimal via addition). “Prime” is valuable for more than one reason. The subtle one is that it allows to compute a hash for rational values that’s the same regardless of the base their numerator and denominator are expressed in, and without dividing by their gcd first.

>>> import fractions, decimal
>>> hash(fractions.Fraction(15, 1000))
1694794611772065054
>>> hash(decimal.Decimal("0.015"))
1694794611772065054
>>> fractions.Fraction(15, 1000)
Fraction(3, 200)
>>> 3 * pow(200, -1, m) % m
1694794611772065054
>>> 15 * pow(1000, -1, m) % m
1694794611772065054

This breaks down if the denominator is a multiple of the modulus, because there is no multiplicative inverse then. That’s a special case.

1 Like