Int/str conversions broken in latest Python bugfix releases

I think that the security risk here is probably a bit overblown but leaving that aside there is a real process problem here where security apparently trumps compatibility to the point that even providing a reasonable alternative for a long standing very much core feature can be neglected. The compatibility policy (PEP 387) says:

  • Similarly a feature cannot be removed without notice between any two consecutive releases.

Note that that’s referring to minor releases but here we’re talking about patch releases. I realise that there is also an exception in the policy for things that are “insecure” if granted by the SC but I would expect the consideration of compatibility options for downstream should weigh much more heavily when breaking all the other principles of the compatibility policy and process.

Presumably any codebase that would be happy with the change just pushed out would also be happy to just s/int/safe_int. In fact they would be happier because then they would know that they weren’t depending on some global flag for security. Note that the global flag isn’t even thread-safe so it could be switched out by another thread.

On the other hand a codebase that actually wants to use potentially large integers can’t do s/int/large_int because the large_int function hasn’t been provided. Some codebases or combinations of codebases (i.e. many libraries as part of an application) will want to do both. If you want to sometimes parse large integers and sometimes parse untrusted input it isn’t possible to simply call safe_int in some places and large_int in others because the separate functions are not provided. The behaviour is only controllable by a global flag that isn’t even thread-safe:

def large_int(s: str):
    # This function is not thread-safe.
    # It could fail to parse a large integer.
    # It could also undermine security of other threads
    old_val = sys.get_int_max_str_digits()
    try:
        sys.set_int_max_str_digits(0)
        return int(s)
    finally:
        sys.set_int_max_str_digits(old_val)

By contrast look how easy it is to write the safe_int function in any version of Python:

def safe_int(s: str):
    if len(s) > 5400:
        raise ValueError
    return int(s)

Suppose I provide an implementation of a change of base algorithm that is sub-quadratic and has reasonable performance for the largest strings and integers up to some very large size. Would it then be considered that the security problem (at least at core level) was fixed in a better way? Then what happens to this set/get API that has been introduced and can presumably never be removed again? Will the flag have to be respected forever because some code will come to depend on this as the way of limiting inputs rather than actually writing secure parsing code?

For comparison here is the performance of GMP on a single core of an ageing CPU (Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz):

In [1]: import gmpy2

In [2]: s = '1' * 10**7

In [3]: %time i = gmpy2.mpz(s)
CPU times: user 928 ms, sys: 16 ms, total: 944 ms
Wall time: 942 ms

In [4]: s = '1' * 10**8

In [5]: %time i = gmpy2.mpz(s)
CPU times: user 14.9 s, sys: 88 ms, total: 15 s
Wall time: 14.9 s

In [6]: i.bit_length()
Out[6]: 332192807

Here I really have to get up to gigabyte sized strings before I start to see run times of minutes:

In [10]: s = '1' * 10**9

In [11]: %time i = gmpy2.mpz(s)
CPU times: user 3min 43s, sys: 2.85 s, total: 3min 46s
Wall time: 3min 45s

Any larger than this and my poor little computer will just run out of RAM.

Part of the difference with GMP here is due to asymptotically faster arithmetic but the biggest difference is just using a sub-quadratic algorithm for base conversion. A reference was given in the issue for such an algorithm Richard P. Brent and Paul Zimmermann, Modern Computer Arithmetic. The algorithms given there for base conversion are algorithms 1.25 and 1.26 in section 1.7 (Base conversion). Here is a pure Python implementation of algorithm 1.25 which is not exactly complicated:

def parse_int(S: str, B: int = 10) -> int:
    """parse string S as an integer in base B"""
    m = len(S)
    l = list(map(int, S[::-1]))
    b, k = B, m
    while k > 1:
        last = [l[-1]] if k % 2 == 1 else []
        l = [l1 + b*l2 for l1, l2 in zip(l[::2], l[1::2])]
        l.extend(last)
        b, k = b**2, (k + 1) // 2
    [l0] = l
    return l0

This algorithm has complexity M(n/4)*log(n) where M(m) is the complexity of multiplying two m bit integers. It’s not clear how much faster this would be for large inputs if implemented in C since the runtime will be dominated by the large integer operations any way. On this machine parse_int is faster than int for strings greater than around 1 megabyte e.g. it can do a 10 megabyte string in 1 minute:

In [10]: %time i = parse_int('1' * 10**6)
CPU times: user 1.33 s, sys: 36 ms, total: 1.36 s
Wall time: 1.36 s

In [11]: %time i = parse_int('1' * 10**7)
CPU times: user 58.5 s, sys: 104 ms, total: 58.6 s
Wall time: 58.6 s

Here is int by comparison:

In [13]: %time i = int('1' * 10**6)
CPU times: user 7.24 s, sys: 20 ms, total: 7.26 s
Wall time: 7.26 s

In [14]: %time i = int('1' * 10**7) # killed after 5 minutes
Terminated

This parse_int function is clearly a lot slower than GMP but still asymptotically it will be a lot faster than int. It’s possible it can be improved with micro-optimisations or that a better algorithm could be used or that implementing it in C would make it some amount faster.

22 Likes