Int/str conversions broken in latest Python bugfix releases

oscarbenjamin · September 12, 2022, 10:22am

I think it would be reasonable for the json module to have a limit on the size of integers with some option like loads(s, maxint=0) for compatibility. Anything greater than 2^53 won’t round trip correctly through JS itself anyway so you should really send it as a string rather than a number and then it can be a hex string (json doesn’t support hex for numbers). Alternatively Python could match JS and turn any large integer into a float.

tiran · September 12, 2022, 10:30am

You are misunderstanding the situation and therefore misrepresenting the situation here.

In the past the Python Security Respone Team has been using Red Hat Product Security (secalert) to handle CVEs for us. They generously provided the service for free and did most of the heavy lifting around CVE requests, assessment, and announcement for us. It’s a non-trivial task and considerable amount of work to write a CVE. For example it took Greg and me several discussions and research to come up a good estimate for a CVSS. The service made our lives easier and reduced our workload.

For reasons it took a couple of extra days until SecAlert has updated the public CVE entry. Amongst other reason I didn’t give them enough head start to work on a public report, because I did not anticipated that the patch was ready for 3.11.0rc1.

Independently from the delayed CVE release, PSF and PSRT have been discussion to become a CNA (CVE numbering authority) for a while. This gives us greater control over CVE assignment and updates. In the past users have requested CVE numbers from MITRE for invalid bugs in an attempt to get a CVE with their name on it. CNA also means more work for us.

The issue was reported independently by several people. If I recall correctly both Django and FastAPI community members flagged the problem as a critical issue and major threat.

lgelbmann · September 12, 2022, 3:06pm

There are some important differences between int(text) and str(integer).

Security: int(text) is much more sensitive than str(integer), security-wise. This is because it’s common to have untrusted strings, while it’s relatively rare to have untrusted integers.

The main avenue for an untrusted integer to appear is from an untrusted string, which is mostly prevented by putting a limit on int(text). It’s no doubt still possible to have integers whose size can be blown up by the user, for instance from an int(text, base=16) conversion or from calculations affected by user input. However, in common scenarios like working with JSON or a typical web UI, the int(text) conversion is an important threat while str(integer) is not.

Note also that the CVE only mentions int(text) and not str(integer).
API breakage: I expect that limiting str(integer) will cause more existing programs to break than limiting int(text). This is because existing code typically already handles a ValueError on int(text) conversions where text is untrusted.

On the other hand, a ValueError on str(integer) is typically not handled. We should not take this lightly: If we assume integer is affected by user input, then this unhandled ValueError can make new attacks possible. I’m not sure whether raising a ValueError is better here security-wise. In contexts where DOS attacks are not a concern, e.g. desktop applications, the API breakage in str(integer) is more severe than in int(text).

It’s also difficult to write correct library code now, because every time you log or print an integer of unknown size, you need to consider the possibility of it being too big.

Less importantly, but still relevant:

Integer literals are almost always trusted, so it’s worth considering lifting the restriction on these.
String-to-integer conversions with a base that is not 2, 4, 8, 10, 16, or 32 are very rare in a security-relevant context. I don’t know any data exchange formats that use them.

With all that said, I’d propose:

str(integer) never raises a ValueError.
Large decimal literals don’t cause a SyntaxError. Perhaps a SyntaxWarning can be used to let the user know that a hexadecimal literal would be faster.
int(text) should raise a ValueError on large input by default. However, int(text, base=n) with n != 10 should not be limited.
In fact, I’d propose that whenever a base is explicitly provided to int(), there should be no size limit. This is for two reasons: When the base is variable, the user is doing something mathematical that probably shouldn’t have a special case for base 10. More importantly, however, this provides an “escape hatch” for library authors who require the old behavior from int() with base 10.

nas · September 12, 2022, 8:57pm

A bit of timing that might be useful when discussing this limit. If I create a JSON data file containing a dict with integers, ether 10,000 with 4000 digits or 1000 with 40,000 digits, I get the time to load as follows:

Python 3.10.6:

python -m timeit -s "import json; s=open('int40_000.json').read()" "json.loads(s)"
1 loop, best of 5: 5.9 sec per loop

With my _pylong change:

1 loop, best of 5: 2.99 sec per loop

So not a massive improvement for this case, 40,000 is not really enough digits for the better algorithms to reduce runtime much.
With smaller integers, 4000 digits:

python -m timeit -s "import json; s=open('int4000.json').read()" "json.loads(s)"
1 loop, best of 5: 660 msec per loop

The JSON file is about 40 MB.

I hope everyone can be civil and realize the people implementing the PYTHONINTMAXSTRDIGITS change are acting as they feel is best for Python. For the vast majority of the Python code I run, 4200 is more than enough. OTOH, I will also certainly at times run into the limit. Depending on the person, I can see how that gets annoying. I’m going to consider setting PYTHONINTMAXSTRDIGITS=0 in my default environment.

tim.one · September 12, 2022, 9:12pm

Is it possible to quantify the reasoning behind the 4300 limit? It’s a peculiar number, which suggests either that it was the output of a computation - or was a SWAG that was made peculiar to make people think it was the output of a computation .

How fast is fast enough? For the 10-million character string "9" * 10_000_000, the asymptotically better str_to_int() in Neil’s PR today is better than 16x faster. the difference between roughly 400 and 24 seconds.

But that’s more than “a few megabytes”. How many megabytes are the implicit limit? On the 3-megabyte string "9" * 3_000_000, str_to_int() is better than 10x faster, about 36.5 seconds down to 3.5. Since we can already squash about 700 4300-character strings into 3 million bytes, presumably burning a second in all is not “a DoS vulnerability”. But is 3.5 seconds really that much worse?

We have an asymptotically much better still version of str->int, but the overheads are so high that it’s still slower than Neil’s current str_to_int() on a 10-million character input. It’s twice as fast at 100 million characters. CPython’s current str() takes well over 10 hours to convert it.

nas · September 12, 2022, 9:18pm

The 4300 was chosen to not break a numpy unit test.

tim.one · September 12, 2022, 9:35pm

Hmm. As an engineering rule of thumb, if I observe a maximum quantity Q in a fair number of real-life traces, I write code to accommodate at least 10 * Q gracefully, given that real life often follows a long-tail distribution instead of a normal distribution. So I wonder how many people would have already “bumped into this” if the limit had been 43,000 instead. Then again, in this specific area there’s a very long tail.

tim.one · September 12, 2022, 9:52pm

CPython’s int divmod is quadratic-time, always, because CPython’s int division always is. The decimal module’s divmod can be very much faster on large inputs, because the speed of its fat-input division is inherited from its fat-input multiplication, and decimal implements two fat-input * schemes better than quadratic-time.

oscardssmith · September 12, 2022, 10:23pm

Why was the number chosen not based on the security concern? What is the maximal number of digits that does not allow a denial of service attack?

nas · September 12, 2022, 10:51pm

I’m only speculating because I wasn’t involved. One man’s security mitigation is another’s pain in the posterior. Basically anything done for security is going to inconvenience someone. There is a story about the original MIT time-sharing systems not having passwords on user accounts.

Imagine there are literally tens of thousands (if not millions) of online services running with Python in the background. If trouble makers realize they can fed those services special long integers that take them orders of magnitudes of time longer to process, that could cause a lot of disruption. I’m guessing that’s the thought process behind the current limit.

Could have the limit been 10x the current value? Perhaps and then software that was exposed for easy attack or running on slow CPUs could set it lower, e.g. with a system wide env var. However, how many places would that get done? Anyone who has worked with large companies trying to get software upgrades done can tell you it is extremely painful and slow.

oscarbenjamin · September 12, 2022, 11:21pm

I’m sure that reasoning about security issues is hard and taking responsibility for implementing a fix that you can reasonably claimed has solved a security issue is also hard. I think that there are problems with process here though. I don’t think that this issue was so urgent that it needed to be discussed in secret and more relevantly now I don’t think that a fix agreed in secret should not be debated and reconsidered now in the public domain (rather than presented as a fait accompli).

Ultimately I think here that the fix is just in the wrong place. Lots of things can be slow but that doesn’t mean that we disable those features of the language. Security issues need to be considered in the right place.

So far the only examples given for a potential security vulnerability are related to json. That makes sense because parsing is a key area for security concerns. I don’t know a lot about json but from a cursory look it seems that it isn’t designed to represent large integers. So why does Python’s json module try to support large integers?

How do other json parsers compare here? Are there any other implementations of json parsers that support arbitrarily large integers?

nas · September 12, 2022, 11:48pm

I think “Chesterton’s fence” applies here. I understand you (and your users) are likely disproportionally affected by this change. So, it is understandable you are unhappy with the change. However, if you think the json module can be fixed to mitigate the issue, you haven’t yet understood the purpose of the fence.

gpshead · September 12, 2022, 11:54pm

The hash randomization issue was already public and associated with Python at the time.

We’re not claiming this int thing is in any way exciting, new, or novel, it was all about context and the decision to keep this one private was primarily because it hadn’t publicly been associated with the Python ecosystem.

Thanks for your pile of links, those are useful context if I ever write up a Retrospective on this.

oscarbenjamin · September 13, 2022, 12:26am

Make your point explicitly rather than vaguely suggesting that I haven’t understood something.

Why not fix security vulnerabilities in the proper place? What exactly is wrong with trying to identify where the unnecessary problems actually are (as I did in my previous comment about json)? Feel free to point to any other implementation of json that has this vulnerability.

I could list lots of things in Python that could be unexpectedly slow. If the only way to make Python “secure” is to globally prevent any of those things from happening then a secure version of Python would probably be useless.

Stefan2 · September 13, 2022, 12:47am

Ruby’s does.

tim.one · September 13, 2022, 12:49am

Define “support” . JSON was definitely not designed by people with numeric experience. The spec is nearly useless. It doesn’t define an integer type, or a floating-point type, just a “number” type, and guarantees nothing about portability - beyond merely noting that the IEEE-754 double floating-point format is widely supported, and so

Note that when such software is used, numbers that are integers and are in the range [-(2**53)+1, (2**53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

But that’s not required. Instead:

This specification allows implementations to set limits on the range and precision of numbers accepted.

without guaranteeing any minimums on those limits.

I almost never use JSON myself, but my understanding is that most implementations map JSON “numbers” to IEEE-754 double-precision floats (same as Python’s float type on all major platforms today). Although the JSON “number” syntax doesn’t have possibilities for spelling infinities or NaNs (and the spec itself points that out).

Rosuav · September 13, 2022, 1:32am

So does Pike’s. The JSON standard never says to truncate numbers to double-precision.

oscardssmith · September 13, 2022, 1:42am

On the other hand, JSON is based on Javascript where the only number type is a double. I think it would be very reasonable for a JSON parser to turn any number greater than 1.7976931348623157e308 to Inf.

Rosuav · September 13, 2022, 1:44am

That’s probably because most implementations don’t have arbitrary-precision numeric types. Bigint types are far from universal, and very few languages have bulitin support for arbitrary-precision non-integers, so I can only think of a handful of languages that would even be capable of mapping JSON numbers to a consistent built-in type without loss. Of them, Python (currently) and Pike both do so correctly for integers, but round non-integers to double-precision. PostgreSQL might have support for larger numbers but I haven’t dug into its JSON support enough.

Rosuav · September 13, 2022, 1:45am

Does the spec ever say “must be restricted to what JavaScript can interpret”? From my understanding, the JSON spec merely defines a grammar, and assumes that values will be mapped to whatever the host language can support.