4,300 happens to be small enough to have int(s) in under 100 usec on modern hardware and large enough to pass unit tests of popular projects that were in our internal test sytems. My initial value was 2,000, which is easier to remember and more than large enough for a 4k RSA key in decimal notation. We later bumped it up to 4,300 because the test suites of some popular Python projects like NumPy were failing with 2,000.
Neither 10 nor 100000000 is big enough to be affected by this fix. They are not even close to the number of digits where this fix kicks in, but the exponent 10**100000000 alone takes over five minutes to compute on my PC, so thatâs already a DOS.
If you think thatâs a problem, what if they enter â4^4^4^4â? The answer has approximately 8.1e153 digits.
On-line calculators are faced with challenges beyond merely str/int conversion, and need their own solutions.
I have probably typed variations of the following code hundreds of times, never even considering that this might cause a DOS attack:
try:
x = int(request.args["foo"])
except (ValueError, KeyError):
raise BadRequest(...)
I very much welcome this fix.
On my computer, it takes about a million digits for that to become measurably slow (although Iâd accept the argument that 100K digits would be slower on a busier or lower-end system). Do you have any sort of request size limit?
TBH I think the limit chosen is extremely aggressively chosen, and could be a lot more generous without causing major problems.
Many server machines are actually running less powerful hardware than your dev machine, and are always handling more requests in parallel. A long enough int to stretch a ~10ms request into a ~100ms request can reduce throughput for all users to 10% [1], possibly more if youâre also logging details about failed requests (which presumably will be the result of such a request).
This may be true, but as weâve seen from the examples above, the people who are unhappy about the limit wonât be happy until itâs completely gone. So it may as well be low enough that theyâll discover and disable it for their apps, and those of us who mostly live in the âints never exceed 64 bitsâ world dominated by other languages donât have any worries at all about users exceeding reasonable limits.
Edit I made the â10%â up. Letâs just go with âdrasticallyâ, because I canât be bothered doing the actual throughput calculation. Certainly one request taking 10x longer on the CPU is going to use up resources that are needed to process other requests, and cause congestion on the server. âŠď¸
Thatâs true, but how long IS long enough to stretch a request until it becomes a problem?
For the record, I havenât lived in the âints never exceed 64 bitsâ world ⌠ever. I grew up with ints being 16 bits, and then met REXX, where numbers have whatever precision you choose (kinda like the Decimal module in Python), before ever getting to 32-bit ints. So when Python showed me arbitrary-precision ints, I was right at home. Youâre absolutely right; any limit will cause problems for at least some apps. But for a change thatâs being rolled out in point releases, minimizing the damage should surely be a priority.
Maybe â10^1000000â then, computation takes me 0.4 seconds and conversion to str takes me 15 seconds.
I didnât think anything about it other than to show a counterexample to the âthere is no way for them to apply the âint->str is quadraticâ part of the attackâ that I replied to.
I havenât seen a DoS-mitigation rationale quite like the current one - like you, I was looking for some reason to believe that an individual conversion near the limit was waaay slow, but thatâs not the issue here. On the currently released CPyhon, on my pretty capable desktop box:
$ py -m timeit -s "s = '9' * 50000" "int(s)"
20 loops, best of 5: 10.6 msec per loop
So even now it appears instantaneous to a human eye, with an input string more than 10x longer than the new limit.
But they donât care about that. Instead thereâs some argument about âwell, what if they have megabytes worth of strings of this size, and go through all of them as fast as possible? No matter the speed of an individual conversion, the lower the limit on an individual string, the smaller the aggregate time they can consume for going through all of a fixed overall total size of input strings, because conversion per string takes worse than time linear in the length of the stringâ.
If this is to be consistently applied, then essentially any operation a server farm can be talked into doing, that can consume a few milliseconds given an input of no more than several thousand bytes,. is equally a âDoS vulnerabilityâ.
For example, this takes more than twice as long as the conversion above despite that the input string is 10 times shorter:
$ py -m timeit -s "import re; s = 'x' * 5000" "re.search('x*y$', s)"
10 loops, best of 5: 24 msec per loop
How can that not be a DoS vulnerability if converting a merely 5,000-digit string to int (which goes a couple hundred times faster) is?
I understand that users generally donât get to supply the regexps an app uses for parsing. But they do supply the strings, and genuine expertise in writing regexps for an re engine like Pythonâs is rare. Iâve generally been the one to fix stdlib regexps that display exponential-time behavior in bad cases, but best I know nobody has ever even tried to probe for quadratic-time cases. Like converting a 50,000 digit string to int, one instance goes so fast nobody cares. Well, until now
.
None of these operate in isolation, though. If an attacker is forced to send 100x the requests, that will be detected (and possibly handled) by other parts of the system, whereas one request thatâs 100x the size will likely pass through as ânormalâ traffic.
Individual requests are also the granularity. With 100 x 10ms requests, other users get an opportunity every 10ms to get a request in. With 1 x 1000ms request, users have to wait over a second. So at least for this kind of DoS attack, we can make a big difference to those who are doing a fixed number of parses per request (e.g. Sebastianâs example above)
Those who allow an unlimited amount of data to be received and processed synchronously are in just as much trouble as they were before ![]()
Steve, I should have spelled this out: the regexp example I gave takes time quadratic in the length of the user-supplied string. The same O() behavior as int(user_supplied_str), although the regexp spelling has a much higher constant factor. There is no DoS argument you can make about one that canât be made about the other, because theyâre both ways of applying standard Python operations to user-supplied strings, with the same asymptotic behavior in the length of the strings.
# try at 5,000
$ py -m timeit -s "s = '9' * 5000" "int(s)"
2000 loops, best of 5: 112 usec per loop
$ py -m timeit -s "import re; s = '9' * 5000" "re.search('9*8$', s)"
10 loops, best of 5: 24 msec per loop
# try at 50,000 - about 10**2 = 100 times slower
$ py -m timeit -s "s = '9' * 50000" "int(s)"
20 loops, best of 5: 10.1 msec per loop
$ py -m timeit -s "import re; s = '9' * 50000" "re.search('9*8$', s)"
1 loop, best of 5: 2.39 sec per loop
A difference is that a hostile user has to find a vulnerable regexp in the stdlib, and contrive a bad-case input for it. Which I expect theyâll do, now that Iâve mentioned it
.
Iâve opened a PR to refactor the PyLong_FromString function here:
What is the argument for limiting str(x) for integers? I understand the argument that int(s) can be used for DOS attack, but that does not explain why str(x) should be limited. If a program doesnât allow big integers in its input, then it generally wonât try to output big integers either, unless it actually intends to print a big int.
Another problem that arises from limiting str(x) is that this kind of program
x = int(input())
print(x + 1)
will now get a very unexpected error on the print line by being given the input '9' * 4300. So the limit on str(x) essentially introduces overflow errors to Python. This wouldnât have been the case if the limit was only introduced to int(s).
The CVE fix does not disable big integer support. It only prevents inefficient conversion from decimal text representation to integer and vice versa. Built-in Types â Python 3.12.1 documentation has a complete list of APIs that are limited. Octal, hexadecimal, and int from/to bytes conversions are not limited because they have linear complexity.
A possible attack vector is YAML as YAML supports hexadecimal integer notation. An attacker could send you a malicious YAML file with a very large hexadecimal number. If your application uses e.g. logging to log the value, then your application would be affected.
If you need to deal with very, very large numbers, then use hexadecimal notation. Itâs much faster and unlimited. A decimal string with 1,000,000 digits takes about 4 seconds to convert on my system. A hexadecimal number string with 1,000,000 digits takes about 1.3 millisecond to convert.
As someone whoâs still actively working on âTrojan Sourceâ mitigations, I disagree. The main difference there is that the researchers were unwilling to wait for mitigations before making big announcements and trying to claim credit for it, but even despite this many of the biggest editors and renderers (the tools that need fixes, not the languages) had basic mitigations in place. Once Unicode publish more concrete guidance on how code renderers should handle Unicode, theyâll be able to move toward more targeted mitigations, but for now theyâre using a similarly heavy hammer.
This doesnât address the ability of libraries to use it and support their users who donât want to upgrade. You canât have an optional keyword argument that is ignored in older versions, and so every library that wants to use it ends up with a version-selected helper method anyway, likely until the end of time.
It also doesnât address the fact that one userâs DoS is another userâs intentional calculation, but that is the userâs decision, not the libraryâs. Libraries that know users will commonly end up converting very large integers to/from strings to the point where they would consider adding a utility function to do it unsafely can equally just inform their users that some/many/all will need to modify how their app is run.
But most importantly, nothing will solve all the issues brought up in this thread, other than going back in time to having an implicit limit on the size of int again. Just like one day when we get rid of the recursion limit, then likely decide to bring it back because itâs useful for catching runaway bugs, this is likely something that we will come to appreciate as a reasonable check that prevents runaway work from being done.
Since my more exhaustive post was flagged[1], the only point Iâd like to reiterate is that the SC eminently has the power to allow backporting a keyword to int() back to stable releases, which would solve this situation in a much less broken fashion.
-
Iâd be interested to hear how it can be considered off-topic or even discourteous⌠Relatedly, lack of transparency is not helped by hiding (even gentle) criticism. âŠď¸
Because the conversion takes a very long time, and if you have a way to induce someone elseâs app to generate a very large integer, then you may have a way to force it to freeze up while it converts that number to a string (e.g. it lies about the length of some content that an app tries to put into a Content-Length header).
I agree this is the less exploitable side, compared to parsing an arbitrarily long string, but itâs still possible. Itâs also a bit weird for int(repr(x)) on an integer to fail on one side but not the other (though that is what used to happen in the past anyway, so maybe not so weird).
You just typed an entire screenâs worth of numbers into a prompt, is it really that unexpected? ![]()
You were also about to get an entire screenâs worth of numbers in response. Is that really helpful?
The default length limit is pretty arbitrary, yes, but itâs also well beyond the length of a number that humans can comprehend. They are valuable as intermediate results in a calculation, but we as people canât do anything useful with numbers that big. Rendering it as â<4300 digits>â is probably much more useful, even without the leading â9â.
FWIW, I didnât flag it and donât agree with it being flagged.
Yes, and Iâm sure they considered it and decided against it. Likely for the reasons I mentioned above, which theyâve used before for not backporting certain changes.
The SC is our new BDFL - they can, but donât have to post justifications of their decisions. And the rest of us have agreed to accept them. Anyone who disagrees is more than welcome to run for the next SC to have more influence over it, but by definition, their decision is final.
@steve.dower I was talking about malicious user input. I wonder how many Python codes will crash unexpectedly from being given the input '9' * 4300.
The default length limit is pretty arbitrary,
What the exact limit is doesnât matter. This âoverflowâ problem Iâm describing would be happen even if the limit was 200 or 2000000.
From the testing we did to determine the initial default, not very much will crash unexpectedly. Thatâs how we landed on that number.
From the testing we did to determine the initial default, not very much will crash unexpectedly. Thatâs how we landed on that number.
I donât see how you could possibly run tests on this. What kind of tests existed that checked if maliciously giving '9' * 4300 unexpectedly crashes the program before this new limit was introduced?