Int/str conversions broken in latest Python bugfix releases

steve.dower · September 14, 2022, 1:53pm

I think you’re missing the point. We test for how much code is going to break on regular inputs, not malicious inputs. Code that was not being attacked and did not break continues to not break when it’s not being attacked.

Code that is being attacked will now break, whereas before it would simply eat up all your CPU time.

Code that previously worked and now breaks even though it’s not under attack is what we want to avoid. And we had the opportunity to build Python with various limits and run a huge range of code (public and private) through its paces to see if it was impacted.

Virtually nobody was testing for malicious inputs before, and because the mitigation is global, they don’t have to. If we pushed every library to implement their own mitigations, then everyone would be adding tests now to make sure that their own protections work, and users would have basically no idea whether the code they were using was protected or not.

h-vetinari · September 14, 2022, 1:59pm

Thanks for your responses!

Let me say first off that I have huge appreciation for the work of the security team. It’s a task with insane amount of responsibility/pressure, and not a lot of reward besides people complaining one way or another. All that said, your disagreement is unsurprising, because you’re so very close to the fire (which, OTOH makes you uniquely more qualified to comment than 99.9% of others here).

Let me rephrase my point thusly:

the impact of a given security-relevant topic is perceived greatly amplified within security circles, which – not least due to disclosure concerns – tends to operate in a sort of bubble where this perception can further feed back on itself.
people outside the security sphere have a different set of priorities (partially because the blood, sweat & tears of the security folks allows them to remain ignorant of problems), where not every security issue makes the world tumble from its axis and accepting breakage for a somewhat theoretical security concern is a tough pill to swallow^[1].

I don’t propose to have an answer that’s obviously better than the others in balancing these very divergent needs, but I wanted to point out that the SC is not bound by “we cannot backport a feature to stable” – it’s a choice they have (and one that should have been discussed more publicly IMO), all the more so if they can at the same time decide to backport the security mitigation (and breakage that implies) to all stable releases.

With the caveat that this comes from the luxury of being only in the peanut gallery, I still think the reaction was overblown, not transparent enough, and breaking lots of people’s codes, projects, teaching examples, etc. was not worth it, or would have deserved way more public discourse, as well as advance communication.

all the more so since DoS attacks aren’t nearly as critical as divulging sensitive information or remote privilege escalation / code execution. ↩︎

swift · September 14, 2022, 3:14pm

I wonder if decimal is affected by this change? And if not, could one not simply use decimal initially but return an int?

davidism · September 14, 2022, 3:19pm

Other security issues with YAML aren’t relevant to the discussion about int/str conversion. It was used as an example of what is affected, not as an example of something that is completely secure in other ways.

tim.one · September 14, 2022, 5:23pm

No, it’s not. decimal_str <-> decimal.Decimal conversion are linear-time (in both directions).

You could do that - but it wouldn’t help . It’s not the “string” part that’s the real problem, but instead converting between any representations whatsoever based on a power-of-2 base and a power-of-10 base. Your last int(decimal.Decimal) part becomes the quadratic-time part:

$ py -m timeit -s "import decimal; s = '9' * 10_000" "decimal.Decimal(s)"
1000 loops, best of 5: 283 usec per loop

# make input 10x longer, conversion takes about 10x longer (linear time)
$ py -m timeit -s "import decimal; s = '9' * 100_000" "decimal.Decimal(s)"
100 loops, best of 5: 2.82 msec per loop

# but if we go on to convert to int, about 100x longer (quadratic time)
$ py -m timeit -s "import decimal; s = '9' * 10_000" "int(decimal.Decimal(s))"
50 loops, best of 5: 8.84 msec per loop

$ py -m timeit -s "import decimal; s = '9' * 100_000" "int(decimal.Decimal(s))"
1 loop, best of 5: 854 msec per loop

If there’s an app out there that converts strings to decimal.Decimal first, and then to int, the current mitigation gimmick can’t stop a “DoS attack” based on that.

By far the most efficient way to convert a decimal string to a CPython int today is to install the gmpy2 extension (Python bindings for GMP), and use the latter to convert to GMP’s power-of-2 based internal representation first.

$ py -m timeit -s "import gmpy2; s = '9' * 10_000" "int(gmpy2.mpz(s))"
5000 loops, best of 5: 77.9 usec per loop

# Way sub-quadratic, but also super-linear.
# No linear time algorithm is known (or expected).
$ py -m timeit -s "import gmpy2; s = '9' * 100_000" "int(gmpy2.mpz(s))"
200 loops, best of 5: 1.84 msec per loop

lgelbmann · September 14, 2022, 5:23pm

Is this class of attacks (targeting int → str conversions) common enough and severe enough to justify a breaking change in such a core part of the Python API?

Was it also considered that this change enables a new class of attacks? For example, if the very large value is being logged on other users’ requests with logging.info(f"Value: {value}"), the application is now unusable instead of just slow.

int → str conversions are not mentioned in CVE-2020-10735 at all. I feel it’s unusual that a security fix goes this far beyond the scope of the associated CVE, so I would greatly appreciate if you could go into more detail about the reasoning behind this change. int → str conversions are mentioned on the PR, so I trust there was plenty of discussion around this. I do recognize that this was a hard situation with no easy answers for the Python Security Response Team, and I may be missing something here.

davidism · September 14, 2022, 5:33pm

justify a breaking change in such a core part of the Python API?

It sounds like the SC and security team did in fact consider quite a lot before doing so, as explained multiple times in this thread. More and more people asking variations on this same question aren’t suddenly going to get a different answer.

Keep the discussion focused on improving int/str conversion, not on questioning the motives of the maintainers.

if the very large value is being logged on requests, the application is now unusable instead of just slow.

Web applications do not crash on unhandled exceptions, they catch them and return 500 Internal Server Error pages. A 500 error is not a DoS attack.

lgelbmann · September 14, 2022, 5:36pm

That’s why I specified “other users’ requests”. If a large value provided by user A is being logged on a request by user B, user B gets a 500 error.

davidism · September 14, 2022, 5:38pm

A 500 error, wherever it’s raised, is not a DoS attack and is not relevant to the issue of int/str conversion speed being discussed here.

Rosuav · September 14, 2022, 5:40pm

If one user can cause another user’s requests to fail, isn’t that an attack? It denies an unrelated user the ability to use the service.

davidism · September 14, 2022, 5:46pm

I’d be really hard pressed to come up with an actual example of one user’s untrusted int affecting another user. The theoretical logging issue described just isn’t really how things work.

Regardless, that sort of “denial” is not really what “denial of service attack” is about. That’s something that can also easily be addressed by not doing that sort of deferred logging, while int/str conversion is pervasive throughout web libraries.

fungi · September 14, 2022, 5:49pm

As for trying to avoid the problem by using GMP, that comes with its
own risks. This was brought up last week on the oss-security mailing
list as a reachable crash condition from the Python process
(originally reported as impacting the sagemath project):

github.com/aleaxit/gmpy

Overflows causing `gmp: overflow in mpz type Aborted` instead of Python `OverflowError`

opened 08:52AM - 29 Oct 20 UTC

johanbluecreek

I found that I can cause overflows in the following way ``` $ ipython3 In […1]: from gmpy2 import mpz In [2]: mpz(8800501504522337) << 16907148584713995 gmp: overflow in mpz type Aborted $ ``` Working in Python, I would expect the behavior to be to raise a Python `OverflowError` that I can try/except for, instead everything is just aborted and I'm thrown out to the shell: ``` $ ipython3 In [1]: from gmpy2 import mpz In [2]: try: ...: mpz(8800501504522337) << 16907148584713995 ...: except: ...: print('anything') ...: gmp: overflow in mpz type Aborted $ ``` If I'm wrong on the expected behavior or if there is a way of handling this in Python, such comments would be appreciated too. Just for context (if anyone finds this problem in a similar way to how I did): I'm playing around with using SymPy for Genetic Programming, and sometimes I generate large expressions, that is where I noticed the above. If someone ever gets a similar problem, a work-around is to set the `MPMATH_NOGMPY` environment variable to have `mpmath` avoid `gmpy` that will make `mpmath` raise `MemoryError` instead that one can handle in Python: ``` $ MPMATH_NOGMPY=1 ipython3 In [1]: from sympy import exp In [2]: try: ...: exp(exp(exp(37.0))) ...: except MemoryError: ...: print('anything') ...: anything In [3]: ``` because otherwise the problem above arises: ``` $ ipython3 In [1]: from sympy import exp In [2]: try: ...: exp(exp(exp(37.0))) ...: except: ...: print('anything') ...: gmp: overflow in mpz type Aborted $ ```

https://trac.sagemath.org/ticket/34492

So far the only feedback is that it’s an architectural problem with
the library, which seems (to me anyway) unlikely to change any time
in the near future since it’s already been open for a couple of
years.

Rosuav · September 14, 2022, 5:50pm

Okay then. Please explain to me what a denial of service attack IS about, and what we’re actually trying to protect against here. Wikipedia’s definition talks about making something “unavailable to its intended users”, and my understanding is that causing someone else’s pages to fail to render counts as making the service unavailable, so I’m curious what you define it as.

As to it being hard to come up with an example, that’s true of a lot of vulnerabilities… right up until there’s a known attack.

davidism · September 14, 2022, 5:51pm

In this case, and most uses of the term “DoS”, it’s referring to the denial to all users through the exhaustion of resources. In this case, tying up the CPU time spent converting input or rendering output, so that CPU time isn’t available to handle other requests.

As to it being hard to come up with an example, that’s true of a lot of vulnerabilities… right up until there’s a known attack.

I was referring to the theoretical logging issue they proposed, which does not appear to be a real issue. Coming up with theoretical examples of other attacks is not the topic of this thread, and if they believe there’s a security issue they should responsibly report it to the security list.

tim.one · September 14, 2022, 5:59pm

Nobody installs GMP for “security reasons” . Its overriding goal is to be as fast as possible, part of which is not even bothering to try to continue if a memory-allocation function fails. It’s been like that forever, and essentially remains so even if you supply your own memory-allocation functions (GMP will ignore any “error-value” return they may deliver).

More here.

tim.one · September 14, 2022, 7:11pm

People have been extraordinarily reluctant to quantify this in any way. So here’s a guess at what they have in mind: you have a server that does limit the amount of data it will accept from a client. Say the limit is B bytes. Clients can use those bytes in any way to construct decimal strings that an app will convert to ints. If there’s a limit L on the length of a decimal string int() will accept, and converting a string of that length takes S seconds, then they can construct B/L strings with a total conversion time of B/L*S seconds.

If the limit were instead boosted to r*L, for some r > 1, then the conversion time per string of that larger length would be r**2 * S (conversion time is quadratic in the string length), and total conversion time would increase to

B/(r*L) * r**2 * S =
B/L * r * S

which is r times longer.

Presumably the server also imposes a limit on total seconds used substantially higher than B/L*S. So boosting the int() limit by a factor of r can allow an app, with the same amount of input data, to consume as much as r times as many seconds before hitting the time limit.

It’s not a direct effect on any other process running, but is instead, as the CVE says, more a matter of overall “system availability”. The more cycles a process can burn before hitting some enforced limit, there are that many fewer cycles available for other processes to use.

davidism · September 14, 2022, 7:18pm

We’re on the same page. I was replying to the theoretical “logging one user’s input during another user’s request” tangent that came up. Your explanation is what I was also saying in my last post as well.

gpshead · September 14, 2022, 8:50pm

As a general note: The tone being taken by some in this now-long thread is causing many whom you might wish to be engaged to tune out, unsubscribe and/or mute it. People are re-stating things that have already been covered one or more times up-thread and re-asking the same questions. I encourage everyone to spend time reading and digesting the history here. Repetition will not change our answers.

The most constructive thing to do at this point is work on improvements so that:

3.12+ doesn’t run into such issues so easily
(ex: algorithmic improvements so we could raise the default limit?)
Users hitting the problem can easily discover how to work around and if that is appropriate for them.
(ex: exception text & documentation improvements - improving the classroom and REPL experience)

iritkatriel · September 14, 2022, 9:07pm

Another useful contribution would be a balanced and concise summary of this discussion, explaining the issues and the various opinions. That would help engage people who don’t want to catch up on a 120-message conversation.

Would anyone who has been following be up for writing that? Or perhaps a joint writeup by two or more people with opposing views?

nas · September 14, 2022, 9:59pm

Yeah, a short summary would be useful. Maybe a FAQ:

why was it required to be changed in a bugfix release?
why not instead of changing int(), add limited_int()?
why choose 4300 as default limit?
that limit seems too low, why not something higher?
why global interpreter setting rather than something else?
why not keyword parameter of int?
can’t we just fix the algorithms to be sub-quadratic?
can’t we just fix the software calling int() and formatting long integers?

I’m not sure I know all the answers (or questions) but I could make a try at writing a FAQ.

Edit, additional questions:

why was the non-public process used to develop the fix, rather than an open one?
why was the issue considered urgent now since it was originally reported years ago?