Int/str conversions broken in latest Python bugfix releases

oscarbenjamin · September 13, 2022, 1:48am

Do they have this “vulnerability”?

steven.daprano · September 13, 2022, 1:55am

If you do write up a retrospective, maybe we should go all the way back to Python 2.2 or 2.3 when we integrated int and long. At the time that seemed like a great idea, and speaking personally, not having to worry about large ints overflowing has been really convenient.

But convenience is often the enemy of correctness, and I wonder whether this was a mistake? The integration of int/long enables this vulnerability. You can’t fit a 1000 digit (Python) long int in a C 64-bit int.

Although I guess JSON parsers have the opposite problem to deal with in languages with fix-width ints. What if you pass a 21 digit numeric string to a parser expecting to convert it to a 64-bit int?

Guessing further, I suppose that’s no different or worse than expecting a numeric string but receiving non-numeric characters.

This leads to a thought… perhaps another approach might be to abuse __future__ (past?):


from __future__ import safe_integers

And now ints in that module are limited to 64-bit, without affecting other parts of the code.

Web facing code could add that one line to each of their modules. Or flip it: make safe 64 bit ints the default, and allow individual modules to opt-out with from __future__ import big_integers.

Or maybe monkey-patching builtins?


builtins.int = safe_int

which would still allow code to explicitly use bigint when needed.

I dunno, these ideas are probably impractical, and maybe the solution the team came up with is the least worst. With no visibility into what happened, we have to trust you, but with the lack of visibility, its kind of hard to do that

Especially after the last five or six years. Trust in authority is running pretty low world-wide right now, and some of that is going to rub off.

This is where I cannot help but feel disappointed, the security team has known about this vuln for two years. We’ve only known about it for a few weeks. Who knows what solutions we might have come up with that would be better for everyone?

tim.one · September 13, 2022, 2:54am

I’m getting the impression that there is no maximum anyone will commit to - whatever you pick, you can hypothesize a number of determined attackers sufficient to slow down a server farm, and regardless too of how fast base conversions may become (as I’ve mentioned before, there are no linear-time algorithms known, not even “in theory”).

4300 appears to be just “as small as possible without breaking a number of very popular Python ecosystem unit tests - if it weren’t for numpy, it would have been even smaller”.

In the past, I’ve most often been tasked to chip in on “DoS attacks” involving stdlib regexps. Typical: a naively coded stdlib regexp that takes literally hours to fail to match a short (fewer than 70 characters) user-supplied string. Even I consider that to be a real problem.

That’s fine. It’s always been possible, so far, to rewrite them to reliably fail quickly in non-matching cases.

But nobody I know ever gave a thought to the other direction: what if there’s a stdlib regexp that takes, say, several milliseconds to match a specific relatively short user-supplied string? Is that the basis for a “DoS attack” too?

If not, why not? That’s a substantially longer time than it takes to convert a 20-thousand digit string to int today, yet we put a limit of 4300 digits on that operation. The time to convert 20,000 digits is roughly (20000/4300)**2 ~= 21.6x longer than the time limit we (indirectly) imposed today on str->int conversion. Yet nobody would care if it were a fast-but-not-instantaneous regexp match instead?

If that becomes called a “DoS vulnerability” too, we’re hosed.

BTW, I claim some credit for founding the “regexp DoS” industry. Back when researchers were all agog about dict key-collision vulnerabilities, I annoyed one by asking why they didn’t do something useful instead, like probing software for bad-behaving regexp parsing of user inputs. The rest is history .

ofek · September 13, 2022, 2:56am

Wow, that’s extraordinarily arbitrary! Is there proof of this?

steven.daprano · September 13, 2022, 3:02am

Chesterton’s fence analogy applies when removing restrictions, not adding them. We’re not talking about removing a restriction without understanding why it is there. We’re talking about adding a new fence.

So enlighten us. Why is a patch to the JSON parser not sufficient mitigation at least for a first attempt, to buy time while we discuss more fundamental changes?

If you are thinking that there may be some uses of JSON that require the ability to read thousand digit ints, you’re surely right – but they’ve been broken by this patch too.

We aren’t obliged (nor is it possible) to fix everything all at once. Suppose we had decided that the most critical potential or actual attack is via JSON. We can patch that quickly, making it the default, and eliminate 90% of the attack surface without affecting those who aren’t affected by this vulnerability and have nothing to gain from the fix.

JSON users who only use data from trusted sources (e.g. not the public internet) could opt out.

And that would allow time to give other stakeholders (like Sympy) a say in any additional changes.

What other attack vectors are there?

As I see it, there is no way for an attacker to present a Python application with a huge int (say, a thousand digits) without passing it in as a string. There’s no way to pass an int directly into a web application except via text, or maybe bytes, right?

So unless the attacker manages to pass in a thousand character string and have that converted to an int, there is no way for them to apply the “int->str is quadratic” part of the attack.

If my reasoning is correct, we don’t need to care about int->str, since we can block the attack at the str->int surface. At least that should be sufficient for a first patch, to buy time to discuss the issue. Stop attackers passing in huge ints, and you stop them from being able to use huge ints to DOS the app.

I may be wrong, but perhaps all we really needed as an initial emergency patch is to restrict int(digits) to, say, 1000 digits by default, which would buy us time to discuss more fundamental fixes, and not worry about all the int->str conversions.

We could have a separate function that we could use explicitly:


# Libraries that need to parse huge strings into int can do this.

try:

    from unrestricted import parsebigint

except ImportError:

    parsebigint = int

num = parsebigint(digits)

steven.daprano · September 13, 2022, 3:03am

How does Ruby mitigate this attack?

nas · September 13, 2022, 3:21am

Python 3.10.7 exists in the world. The fence is there.

JSON is very popular but what about other formats? E.g. the tomli parser allows long integers too. Less likely it will be parsing malicious input but it probably should be fixed just like json. Then, think about where else ints get parsed from text into number objects. E.g. the Context-Length header in HTTP. For that case, you are likely saved due other things, e.g. max size of the HTTP request. I haven’t thought hard about this and these are just examples that come to mind right now. There are many things besides the JSON parser that could be affected. Grab some modules from PyPI and search for calls to int() maybe. How many of those might take data from untrusted places? How many need to parse integers longer than 4000 digits?

Personally, I’m not very concerned about DoS attacks. For any non-trivial piece of software it seems to me that it is not hard to provoke them. Mitigation would be better done at high level, e.g. memory and CPU time limits for processes or containers. Or some pre-checks on untrusted input data. Most software doesn’t run with those kinds of limits in place though.

Chesterton’s fence doesn’t mean you can’t argue the fence is not needed. However, you should try to understand why the people erected it.

I’m hoping with a bit more optimization and better algorithms, we can remove the limit or set it higher by default. At minimum, we should make it easier yet to remove (e.g. with a context manager) so that libraries like sympy could avoid running into it.

oscardssmith · September 13, 2022, 3:24am

Most of these cases are cases where you are either parsing a number of items (e.g. an int64) or a double. As such, it seems like the correct solution is for python to have an int64 function and a float function, and tell libraries to use those where appropriate rather than breaking int in a patch release.

gpshead · September 13, 2022, 3:30am

We did.

Upon investigation we determined the proper place was not the one place it was reported multiple times (json), but that it was fundamental because we found uses of int->str and str->int all over in all sorts of libraries both in the stdlib and externally that may be processing untrusted data without a guarantee of constrained input length.

>>> "proper place" != "where it doesn't inconvenience me"
True

oscardssmith · September 13, 2022, 3:34am

>>> "proper place" != "where it doesn't inconvenience me"
True

Agreed, that’s why it’s such a shame that it was fixed in the “convenient” place of breaking the int function. Just because there are a lot of libraries that are parsing ints that they intend to be small doesn’t mean that the int function has a bug. It means that any code that expects a small integer and calls int has a bug. It’s not convenient, but that is the place where user intention doesn’t match the code they wrote. The solution is a short_int function for these libraries to call. It won’t be convenient, but it is (IMO) the proper place for the fix.

Stefan2 · September 13, 2022, 3:57am

How about an online calculator that shows exact integers, and you enter “10^100000000”?

I don’t know whether/how it does. But it seems pretty fast:

require 'json'
puts RUBY_VERSION

s = '1' * 100_000_000
3.times do
  t = Time.now
  JSON.parse(s)
  p Time.now - t
end

Output on TIO.run:

2.5.5
17.198645811
17.403922713
17.078832864

zware · September 13, 2022, 3:59am

Disclaimer: I’ve only been skimming this discussion, but this thought jumped out at me.

This actually suggests to me a possible compromise that could have legs: reintroduce the old long name for int-with-no-limit. Everything can go on using int as it currently does, but those places that really do need to not be limited can use long to signal that fact.

oscardssmith · September 13, 2022, 4:33am

For the record, timing Julia here on my computer (ryzen 3600 cpu) which just calls out to GMP

julia> s="1"^100_000_000;
julia> @time parse(BigInt, s);
  7.546348 seconds (137.31 k allocations: 2.772 GiB, 0.29% gc time)

so my guess is this is just Ruby calling out to GMP on a slower cpu.

tim.one · September 13, 2022, 4:45am

Seems likely. Adding “puts Integer::GMP_VERSION” to Stefan’s code, on the server he linked to, displays “GMP 6.1.2”.

tim.one · September 13, 2022, 4:48am

Such a calculator would almost certainly be programmed to use the decimal module .

Rosuav · September 13, 2022, 7:39am

I don’t know enough about the vulnerability to test it. Maybe my computer’s too fast to see it, but x = json.loads("9" * 50000) is practically instant. Do we have actual demonstration code to play with?

Rosuav · September 13, 2022, 7:55am

Cranking the length up suggests that MAYBE there’s a problem around the million-digit mark. Here’s Python and Pike on the same CPU:

>>> start = time.time(); x = json.loads("9" * 1_000_000); print(time.time() - start); x = json.dumps(x); print(time.time() - start)
3.963656425476074
14.79038691520691

> gauge {x = Standards.JSON.decode("9" * 1000000);};
(10) Result: 0.05868104
> gauge {x = Standards.JSON.encode(x);};
(11) Result: 0.106752528

With Pike, it starts to drag when I get to ten million or a hundred million, and yes, it’s superlinear time complexity (half a second for 10M, ten seconds for 100M). But I’m not sure how to validly test for this as a vulnerability. If a remote attacker is sending me 10MB of JSON data to parse, I’m going to be feeling the squeeze from having to transfer that data around long before the CPU cost is a problem. Maybe the one-million figure from Python could be a problem, but 4300 digits is hardly the limiting factor.

jpodivin · September 13, 2022, 9:59am

I feel like there are two problems being conflated here.

Solution to the security issue having undesirable effects.
Process used to determine what that solution would look like being too opaque.

It may be helpful for the discourse to separate these. On the other hand, one plays into the other.

Maybe, if folks could see some record of the discussion that led to this solution, they would be more inclined to trust it? I suppose there is an archive somewhere.

deadeye · September 13, 2022, 10:30am

I like the change because it makes Python safer. I have only one question. From where comes the value 4300 for maximum digits? Why 4300 and not 4301?

MarkCBell · September 13, 2022, 10:46am

Just for awareness, this has also broken a number of problem set for students. For example, “What is the leading / trailing / 500th digit of 4^(5^6)? How many times does each digit appear?”. Students used to be able to solve by doing something like:

n = 4 ** (5 ** 6)
s = str(n)
print(s[0], s[-1], s[-500]))

from collections import Counter
print(Counter(s))

Of course, the second line now raises a ValueError since our cloud learning environments have all updated to the latest patch release.