Int/str conversions broken in latest Python bugfix releases

oscarbenjamin · September 9, 2022, 9:17pm

It looks like pickle is happy to dump an integer but fails to read it back again:

>>> import pickle
>>> s = pickle.dumps(10)
>>> pickle.loads(s)
10
>>> s = pickle.dumps(10**10000)
>>> len(s)
4170
>>> pickle.loads(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Exceeds the limit (4300) for integer string conversion

tiran · September 9, 2022, 9:27pm

pickle happily loads large ints. The exception is coming from implicit repr().

>>> import pickle
>>> s = pickle.dumps(10**10000)
>>> i = pickle.loads(s)
>>> hex(i)
'0x9b84ea28556bf2...'

oscarbenjamin · September 9, 2022, 9:28pm

I’ve been thinking about this also for things like Spyder, Jupyter etc. One problem with that approach though is that then code that seems fine in one context fails in another.

oscarbenjamin · September 9, 2022, 9:28pm

Ah, good point.

gpshead · September 9, 2022, 10:39pm

Regarding pickle, if you explicitly use the ancient low pickle protocols 0 & 1 they use a str or repr and thus fail:

>>> pickle.dumps(10**5000, 1)
ValueError: Exceeds the limit (4300) for integer string conversion

Those are very old protocols, I don’t know why they would ever be used today.

gpshead · September 9, 2022, 10:53pm

That was my initial thinking, but it may be more practical to acknowledge that when a user is sitting in front of a terminal entering code interactively that a higher limit makes sense as a way to reduce friction.

For notebooks, as those are higher level and have more control over their environment, one could even imagine automating the choice up front and recording it via an initial explicit sys.set_int_max_str_digits call cell in the notebook. Or automagically suggesting inserting an appropriate one upon encountering a ValueError from an attempted int repr?

oscarbenjamin · September 9, 2022, 11:19pm

This is algorithm 1.26 from the reference cited above:

from functools import lru_cache

@lru_cache
def integer_log(y: int, x: int):
    r = e = 0
    while y >= x:
        d = x
        m = 1
        while y >= d:
            y, rem = divmod(y, d)
            r = r or rem
            e += m
            if y > d:
                d *= d
                m *= 2
    return e

# Algorithm 1.26:

#@profile
def format_int(A: int, B: int = 10) -> str:
    """Format integer A as a string in base B"""
    if A.bit_length() < 1000:
        # Here we use str for the base case but really this should be something
        # that works for arbitrary bases using the standard O(n^2) algorithm.
        # The threshold should be tuned.
        return str(A)
    else:
        # find k so that B**(2*k-2) <= A < B**(2*k)
        e = integer_log(A, B)
        if e % 2 == 1:
            k = (e + 1) // 2
        else:
            k = e // 2 + 1
        #assert B**(2*k - 2) <= A < B**(2*k)

        Bk = B**k
        Q, R = divmod(A, Bk)
        r = format_int(R, B)
        q = format_int(Q, B)
        pad = '0'*(k - len(r))
        return ''.join([q, pad, r])

Unfortunately it doesn’t seem to perform as well as algorithm 1.25 and seems to have quadratic run time. For large inputs half of the time is in divmod and half in integer_log. Improving integer_log might not be difficult but improving divmod seems harder. I’m wondering if divmod has the complexity I thought it had…

oscarbenjamin · September 10, 2022, 12:44am

In fact the time in integer_log is just spent in its own call to divmod so everything comes down to that.

nas · September 10, 2022, 5:39am

Better algorithms are key, they don’t have to be implemented in C. I have a PR in progress, using algorithms written by some experts. With it:

./python -m timeit "i = int('1' * 10**7)"
1 loop, best of 5: 20.4 sec per loop

davidism · September 10, 2022, 1:10pm

Please keep hyperbole out of this discussion. Python has not become unusable. The maintainers clearly care about the success and safety of the language.

It’s possible to participate in this discussion without adding negativity or stress. Please show the same understanding and patience that the maintainers are showing in their responses.

nas · September 10, 2022, 11:21pm

Having a context that can locally override the max_str_digits setting would be nice. Draft PR that still needs a bunch of polish yet. With it, you can do:

with _pylong.localcontext() as ctx:
    ctx.max_str_digits = 0
    n = (1 << 100_000) - 1
    s = str(n)

Likely _pylong wouldn’t be a public API. Also, you probably want an easy way to go to unlimited mode, e.g.

with unlimited_str_digits():
   ...

oscardssmith · September 11, 2022, 9:44pm

to me, the thing that’s really frustrating about this change is that it really seems like the breakingness of this change could have been fully avoided by just making the function faster. just moving this from an n^2 algorithm to an n*log(n)^2 one would be enough that it would be a completely infeasible DoS vector. given this, and the relatively low severity of this attack (no known targets in 2 years of it being known about), it seems very odd that a decision was made to urgently make breaking changes in a patch release rather than just fixing the algorithm.

EpicWink · September 11, 2022, 11:09pm

Would you be willing to be personally responsible for any issues arising from a bug in the implementation of this algorithm, especially across all current usage of Python?

What if certain numbers cause incorrect predictive branching on the CPU, leading to runaway slowdown or hardware crashes. What if certain compilers make incorrect assumptions, causing race conditions on memory write, leading to incorrect results?

There is a huge benefit to using the existing implementation, as more than the explicit testing, millions of users have implicitly tested many scenarios.

This consideration applies less for unreleased versions of Python.

smontanaro · September 11, 2022, 11:11pm

Is the intent that if these algorithms are adopted the recent change (to several Python versions) would be reverted?

tim.one · September 11, 2022, 11:53pm

That’s not the goal. The quadratic-time behavior of decimal_string <-> int conversions has been an ongoing background complaint for decades. With an odd mix of complainants: newbies, and number theory researchers .

Follow the links starting with what Neil gave, and you’ll find issue reports going back to long before DoS complaints became a thing, and a trail of abandoned code.

The number-theory researchers should install gmpy2. Python will never compete with that on speed. And, follow all the code needed, and GMP may well have more lines of C code (+ platform-specific assembler) supporting “state-of-the-art fast” conversions than Python’s whole code base. We don’t have a small army of numeric expert volunteers willing to take on such a maintenance burden, but GMP does.

What’s different now, which Neil is pursuing, is leaving asymptotically faster algorithms in Python, called by the C code when arguments get “too big”. That enormously lowers the bars for designing, correct coding (including avoiding all of the near-infinite possibilities for refcounting, memory management, and error-checking mistakes in C), readability, and maintenance burden. In return, no, except in rare cases these will certainly be slower than they could be in C. Don’t care. If we can reduce a thing from O(n**2) to (say) O(n**1.6), that’s far more valuable than cutting a constant factor of (say) 2.

Whether it will be enough to revert the DoS mitigation changes, I have no idea. How fast is fast enough? There’s no limit on the size of Python ints or strings, except for the amount of VM available to store them, and even a linear-time algorithm may be called a “DoS vulnerability” if the inputs are fat enough. The algorithms in question will never be as fast as linear-time, and not even if GMP is used.

oscarbenjamin · September 12, 2022, 12:38am

I’ve been looking at writing algorithm 1.25 from above in C for CPython. It’s actually not difficult to write that algorithm in C with the PyLong API. The difficult thing (at least from my perspective) is just that the existing code is somewhat spaghetti-like and needs a refactor:

github.com

python/cpython/blob/53a54b781d1f05f2d0b40ce88b3da92d5d23e9d2/Objects/longobject.c#L2302


      
          
          /* Parses an int from a bytestring. Leading and trailing whitespace will be
           * ignored.
           *
           * If successful, a PyLong object will be returned and 'pend' will be pointing
           * to the first unused byte unless it's NULL.
           *
           * If unsuccessful, NULL will be returned.
           */
          PyObject *
          PyLong_FromString(const char *str, char **pend, int base)
          {
              int sign = 1, error_if_nonzero = 0;
              const char *start, *orig_str = str;
              PyLongObject *z = NULL;
              PyObject *strobj;
              Py_ssize_t slen;
          
              if ((base != 0 && base < 2) || base > 36) {
                  PyErr_SetString(PyExc_ValueError,
                                  "int() arg 2 must be >= 2 and <= 36");

It’s taking me some time to figure out exactly what each macro etc in the existing code does so I can be sure what the implications of any potential change are.

I like the idea of making it possible to fix things like this in Python but with the limit set at 4300 that suggests that the main target for improvement is in the small-ish to intermediate range of integer sizes and at that level I expect that a C implementation is still going to be significantly faster than a Python one.

tim.one · September 12, 2022, 2:26am

Why? Looks like the Python code Neil has (str_to_int() in gh-90716) now is already faster, although likely the same asymptotic behavior (basically the same algorithm, but top-down recursive instead of bottom-up). Although that’s limited to decimal string inputs, because nobody in real life gives a rip about strings in bases other than 10 and powers of 2 (and the latter are already fast).

That limit doesn’t make sense to me, and I haven’t been able to find the reasoning behind it. Here with the currently released 3.10.6 on my box:

$ py -m timeit -s "s = '9' * 4300; i = int(s)" "int(s)"
5000 loops, best of 5: 84.8 usec per loop

$ py -m timeit -s "s = '9' * 4300; i = int(s)" "str(i)"
1000 loops, best of 5: 271 usec per loop

How can an operation that can be performed well over a thousand times per second be considered a vector for “a DoS attack”?

But it’s also so small that asymptotic behavior becomes increasingly irrelevant. In general, the fancier the algorithm, the higher the overheads to keep track of it all. That’s why, e.g., nobody on Earth uses FFT-based integer multiplication until operands get “very” large.

For smaller inputs, I expect the code in Neil’s PR gets more attractive, because it doesn’t incur the overheads of building temporary lists. It builds a dict to memoize the powers actually needed, but that’s O(log log n) space (tiny).

tim.one · September 12, 2022, 2:43am

BTW, there’s “a reason” for that: because it’s restricted to base 10, it can compute base ** i as (5*2) ** i = (5**i) << i and shifts are very much faster than multiplying bigger integers. There’s no end to pursuing speed in these areas, so no matter what you do it’s going to expand, and expand, and expand. Another reason to keep it, if at all possible, in Python instead of in C.

gpshead · September 12, 2022, 8:45am

That was the point of choosing such a limit.

Picture a CPU core handling N things second, which can include a few megabytes of data which could be something full of a thousand integer strings (ie: JSON). It’s still possible to get something to spin an entire CPU core for a second in an application like that even with the 4300 digit limit in place. What isn’t possible anymore is getting it to spend infinite time by sending the same amount of data. With mitigation in place an attack now need to scale the data volume linearly with desired cpu wastage just like other existing boring avenues of resource waste.

steven.daprano · September 12, 2022, 10:02am

Thanks @oscarbenjamin for raising this issue on Discuss.

RedHat is a global company valued at USD $34 billion (with a B) when IBM purchased them three years ago. Let’s not blame overworked volunteers here.

If this were a high priority, or even a medium priority, for RedHat, the CVE details would have been populated less tardily.

This vulnerability is in a class of attack that has public knowledge for over a decade: pass a huge chunk of data to the application and try to DOS it or overflow a buffer. The specific vulnerability (quadratic behaviour of str ↔ int conversions) has been public for at least four years. RedHat reserved the CVE 30 months ago, and the Python security team has known about it for two years.

Similar XML DOS attacks have been known before 2009, although I’m not sure whether specific str ↔ int conversion attacks would work.
A four year old Scala example.
A four year old whitepaper on the issue.
Discussion on Reddit.
A Haskell example made public a year ago.

I don’t buy the need to push out a fix to this without public discussion. This class of vulnerability was already public, and has been for years.

Even if it turns out that the chosen solution is the best, or only, solution, the secrecy and “trust us, we know best” from the security team was IMO not justified.

I feel that string hashing attacks are much more serious. Applications can, if they choose, perform length checks on data much more easily than they can recognise hash collisions. And yet we discussed hash randomization publicly in 2011 and 2012 before pushing it out. If we could discuss that publicly, why couldn’t we discuss this?

It seems that the threat here was in no way severe and urgent enough to justify the lack of community discussion.