API discussion for improving the CVE-2020-10735 mitigations

h-vetinari · October 22, 2022, 12:17pm

Now that the dust on https://discuss.python.org/t/int-str-conversions-broken-in-latest-python-bugfix-releases has settled, I wanted to open the discussion on future APIs that can improve some of the problems left by the mitigations (e.g. dependence on global state).

Since that topic is locked I cannot announce it there, hence the new topic.

I started off with a concrete proposal in Add `limit=` keyword to `int()` and `str()` functions to avoid contention on global `PYTHONINTMAXSTRDIGITS` · Issue #98547 · python/cpython · GitHub, because I think call-site-level explicitness (defaulting to PYTHONINTMAXSTRDIGITS if not given) is the only solution that satisfies all of:

safe by default
avoid guessing whether the programmer had considered that this particular call might blow up
allow an explicit override with no knock-on effects
ergonomic, non-intrusive API

Of course this doesn’t rule out further things like context managers etc., so if people want to bend that discussion into different directions, I don’t mind. I just wanted to get the ball rolling.

oscarbenjamin · October 24, 2022, 11:28am

I think it would be better to have an explicit function that parses strings to int and accepts arbitrarily large decimal inputs by default with no keyword parameter needed. Then that function is a dropin replacement for code that was previously using int to parse decimal integer strings.

For str/repr the situation is more complicated because they are often used indirectly:

class MyThing:
    def __repr__(self):
        return f'MyThing({self.large_integer})'

If you propose to add a keyword parameter to str then is that supposed to be passed to __str__? How does the parameter reach where it needs to be in a recursive call? (This is precisely why using global state for things like this is a bad idea.)

h-vetinari · October 25, 2022, 3:07am

int_unlimited = lambda x: int(x, max_digits=0)

Naming discussion aside, the points about recursion and default string conversions are good ones, but AFAICT that ship has sailed with the existence of PYTHONINTMAXSTRDIGITS. It could be solved through context managers, but never python-globally^[1].

And, that’s sad if your use-case is affected (unless eventually the proliferation of max_digits-safe calls allows lifting the default limit to infinity again), but I don’t see why the lack of a perfect solution should hinder the adoption of a clear (IMO) improvement over the status quo (reliance on global state).

PS. I’m now using max_digits= instead of limit= in the examples since that kwarg-name was suggested as better in the issue.

though it needs to be said, that it would be possible to use the max_digits-imbued functions – at the cost of extra syntax – even in those cases ↩︎

oscarbenjamin · October 25, 2022, 10:12am

There is a significant overhead in using a lambda function:

In [44]: s = '1'

In [45]: f = lambda s: int(s)

In [46]: %timeit int(s)
170 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [47]: %timeit f(s)
229 ns ± 2.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I expect that overhead would increase with the use of keyword arguments as well.

To preempt the obvious question I’ll ask and answer it myself:

Why am I timing with small strings while asking for a function that can correctly parse large strings?

The answer is that in context small strings are still the 99% case but the code should be expected to handle large inputs as well. Some considered that the previous behaviour represented a silent bug because it would suddenly become slow for a rare large input. However in the usecase where large inputs are potentially expected this is reversed: the limit-by-default represents a bug because it means that the default behaviour is to fail unexpectedly on some perfectly valid inputs.

steven.daprano · October 25, 2022, 10:55am

If this int_unlimited is a builtin, it wouldn’t have the overhead of calling the int function from Python code. We could make it as fast as the original int function was.

I’m not sure that we need a seperate function for this.

For applications that deal only with small ints with not too many digits, they can just use int;
For those that need to deal with big ints with many digits, the overhead of the string conversion will surely dwarf the overhead of providing a keyword parameter. (I guess, I haven’t measured it.)

oscarbenjamin · October 25, 2022, 11:29am

I thought I was overdoing it by asking and answering my own strawman question but clearly not since you have missed precisely the point it was supposed to make :). What is needed is a single function that simultaneously has these two properties:

It should be fast especially for small inputs.
It should still work correctly for large inputs.

It isn’t possible to simply separate the use cases where the inputs are small and large: the same code needs to run in both cases! Almost all the time the inputs will be small but the code should still work correctly when the inputs are large.

h-vetinari · October 25, 2022, 12:11pm

Sure, there are no free lunches to be had anywhere around here. The mitigations for CVE-2020-10735 went much further than performance degradation though, they actively broke working code in a patch release.

In an ideal world, we wouldn’t lose any performance, but given that this limit is now in place and +/- immovable, why reject a solution that unbreaks several legitimate use-cases even if it has a certain performance penalty?

Not that I’d have anything against another built-in in principle, but that comes with its own set of associated costs, and from my POV those costs probably outweigh the performance degradation you point out. Opinions may differ of course, but to me, performance is an odd thing to focus on in this topic, when we have IMO bigger problems (broken – as opposed to “merely” underperforming – downstream code & suboptimal mitigations that depend on global state).

oscarbenjamin · October 25, 2022, 3:59pm

What I still don’t understand is the reluctance to provide a function that can work as a drop in replacement for the previous behaviour. That is the simplest and most obvious way to mitigate the compatibility problems and restore any lost functionality. It also provides the simplest path for downstream code to adapt. A function that defaults to always working but has an optional max_digits parameter could be used like:

try:
    from wherever import large_int_from_string
except ImportError:
    large_int_from_string = int

val = large_int_from_string(s) # not using the parameter

nas · October 25, 2022, 5:48pm

Changing int() (with keyword or variant function) handles the str-int case but doesn’t handle the int-str case. There is no way to add a kwarg or variant function for that path. E.g. "%d" % n.

steve.dower · October 25, 2022, 7:18pm

As Neil said, it doesn’t handle anywhere near all the cases. So it becomes a significant implementation/documentation/testing/maintenance effort for something that isn’t actually useful. (The lambda overhead isn’t really relevant either - we’re talking here about explicitly spending multiple seconds converting strings - a couple of microseconds for each one is irrelevant.)

The fix we really want is a context-local override of the limit.

That can then be used with a context manager (which we can also provide) to handle all cases within a contained scope, rather than forcing you to trace through all of your code to find the exact call to int (if it even exists) so you can modify or patch it. Context-local (i.e. contextvars) means it won’t affect any other running code, and there’s very little perf overhead because the conversion won’t even look up the variable unless it exceeds the minimum threshold (as it does today).

oscarbenjamin · October 25, 2022, 9:49pm

Since you repeat the same very basic misunderstanding let me rephrase once more: the previous functionality of int(str) was used to parse both small and large strings with the same call to the same function. The fact that int still works for small strings does not mean that all we need now is a separate API whose only purpose is for large strings and whose performance for small strings is irrelevant.

The fact that someone wants to call int on a string and not have it fail for large inputs does not mean that the inputs are always expected to be large. In fact precisely the opposite is usually true: the inputs are almost always small. Conversely, expecting the inputs to be usually small does not mean that it is okay to fail on large inputs.

I am not describing a niche use case here but rather:

Stating the functionality that existed for a long time and was recently removed.
Explaining precisely how that functionality was used.

You might not have thought about this explicitly yourself but when using int(str) I’m fairly sure that you will have at least implicitly understood these two basic points:

In the place where you call int(str) the string will almost always be small and your expectation is to get a small integer.
The function still works if the input string is large though and will then produce a large integer.

In some other languages point 2 doesn’t hold and so every time you go to parse a string into an int you have to consider if it will overflow on valid inputs. The fact that you haven’t needed to worry about that in Python is precisely because of point 2. After the recent changes point 2 no longer applies to int(str). There is no reason that there cannot be another function that does satisfy point 2 though.

h-vetinari · October 25, 2022, 10:32pm

I think a context manager would ultimately be necessary as well (solving essentially all the concerns for code that is willing to take the extra ceremony), though I feel it’s more or less orthogonal to the keyword for int, which provides a less complete solution, but is much easier to use.

Personally I’d like to see both.

steve.dower · October 25, 2022, 10:58pm

That was my very next sentence

The point is, a context manager alone isn’t enough, because the current limit is global. Once the limit is no longer global, the context manager is a convenience but not essential because it’s possible for someone to write their own. But right now, it’s impossible until we fix the scope of the current limit.

h-vetinari · October 25, 2022, 11:09pm

No worries, that was just me quoting badly on the phone. I equated context-local override with a context manager anyway.

I’m not sure I understand - a context manager can override the global limit within its scope, yes? I happen to agree that a context manager isn’t (good) enough, but within its scope, things should work identically as previously (AFAIU).

steve.dower · October 25, 2022, 11:43pm

Other threads or async tasks will be affected if you change the current limit - same as if you change the current working directory or sys.stdout. It’s a process-wide setting (technically runtime wide, but good luck having multiple runtimes in the same process ).

For example, if you have code that removes the limit, does a conversion, then restores the limit, and it happens to run on two threads at nearly the same time, you might restore the limit before doing the conversion instead of after it. Similarly, if you are using async and have an await within the context manager, another task might restore the limit while you still think it’s been removed. A context-local limit avoids this.

What the context manager does is make it so you are guaranteed to call the code that restores it. It’s essentially the same as putting a try/finally around the code block to ensure that you always run the cleanup code. But if the setting itself is global, the effect of changing it is still global.

h-vetinari · October 26, 2022, 12:05am

Thanks for the explanation. For one, I wouldn’t use the same global variable in the ctx directly, but an ephemeral replacement that supplants the global variable in the scope of the ctx.

[snip]

… and now I realise what you meant by context-local override. Yes we need that (I had just assumed that to be possible from the get go).

Btw. the async problem exists just as well without a context manager - it’s just an inherent problem of global state in the context (ha!) of multithreading, and one more reason why this should be tackled with alacrity.

steven.daprano · October 26, 2022, 12:08am

Sure, and until you demonstrate otherwise, that function remains int. Either by setting sys.set_int_max_str_digits(0) once at the start of your application, or with some future keyword argument like int(string, maxdigits=0), or with a context manager.

The simplest, low-impact solution to this issue for application writers will surely be to set the environment variable PYTHONINTMAXSTRDIGITS=0 or to call sys.set_int_max_str_digits(0) once at the start of their application.

I suppose that there will still be some overhead: internally, the int() function will contain a test that it didn’t have before, but I doubt that it will cause a significant performance regression. In any case, the onus is on somebody to demonstrate that performance regression before we consider acting on it.

Before deciding that we need a new builtin that duplicates the old int (i.e. one which performs string to int conversions but with no length check) it will be necessary to demonstrate that new int with length check has such a severe performance regression that we cannot live with it.

As far as library authors go, things are a little murkier.

Regardless of how we spell the old int() with no length check (setting the global variable, using a context manager, using a new parameter to int(), or using a hypothetical new builtin), libraries should not do that as it may reintroduce the DOS vulnerability into applications that have not taken steps to mitigate it with their own length checks.

Since Python doesn’t have any concept of tainted and untainted strings, libraries in general must assume that all strings are potentially from an untrusted source, and always leave it up to the caller (the application) to set the string max digits.

That means that even if we introduce a new builtin that ignores the length check, libraries should never use it.

Instead, the pattern should be:

the application sets the global, or uses a context manager:

with int_check(maxdigits=0):
```
  result = library.function(string)  # calls int(string)
```
the application explicitly passes on a parameter to the library to use:

obj = library.SomeClass(arg, maxdigits=0)

result = obj.method(string) # calls int(string, maxdigits=self.maxdigits)

By the way, it is now nearly seven weeks since this vulnerability was made public. There are surely hundreds, or thousands, of web servers around the world still running older, unpatched versions of Python. Do we have any examples of concrete exploits in the wild for this threat yet? A threat so urgent that the security team sat on it for over two years before rushing out a breaking change without community consultation or discussion.

There’s large, and then there’s LARGE. Outside of integer maths applications like Sage nearly all applications will consider a 100 digit number to be inconceivably “large”, and a thousand digit number to be invariably some sort of data corruption, let alone the default setting of 4300. The total number of subatomic particles in the entire universe can be written out using just 80 digits. If we include photons (light) we just need 90 digits.

It is only maths geeks like you, and me, that worry about being able to handle million digit numbers.

h-vetinari · October 26, 2022, 12:49am

That paragraph is a bit incongruous in the context of the rest of your answer. Personally, the question you pose is part of the reason why I was very unhappy with how this whole situation was handled^[1].

But much as I disliked the way it came about, I’ve come to view the conversion limit as an immutable given now (it’s “politically” impossible to reverse course anyway), and decided to move forward as best as possible within that new set of constraints.

made worse by the lack of a nuanced, broadly communicated rationale from the SC, and then declaring the discussion about the governance aspects off-topic. ↩︎

Rosuav · October 26, 2022, 1:08am

Steven D'Aprano:

What is needed is a single function that simultaneously has these two properties:

It should be fast especially for small inputs.

It should still work correctly for large inputs.

Sure, and until you demonstrate otherwise, that function remains int. Either by setting sys.set_int_max_str_digits(0) once at the start of your application, or with some future keyword argument like int(string, maxdigits=0), or with a context manager.

The simplest, low-impact solution to this issue for application writers will surely be to set the environment variable PYTHONINTMAXSTRDIGITS=0 or to call sys.set_int_max_str_digits(0) once at the start of their application.

I suppose that there will still be some overhead: internally, the int() function will contain a test that it didn’t have before, but I doubt that it will cause a significant performance regression. In any case, the onus is on somebody to demonstrate that performance regression before we consider acting on it.

Before deciding that we need a new builtin that duplicates the old int (i.e. one which performs string to int conversions but with no length check) it will be necessary to demonstrate that new int with length check has such a severe performance regression that we cannot live with it.

Question: What does “work correctly” mean? You’re claiming that int() would check a global configuration parameter and then either be limited or unlimited - so that means you’re of the opinion that raising an exception on large inputs is working correctly? Genuinely confused here, as I had been of the understanding that “work correctly” meant “give the correct result, as previous versions have done, and without O(n²) complexity, but potentitally at the cost of small-parse performance”.

What exactly are you arguing for here?

steve.dower · October 26, 2022, 3:46pm

Criticism is fine, this just isn’t the thread for it. Use the old thread, or communicate it directly to the Steering Council, as it’s their mandate that we’re doing the work beneath.