Imposing a hard limit on Python integers

storchaka · August 29, 2024, 12:58pm

Currently, the upper limit of Python integers is determined by the size of the block of memory that can be theoretically and practically allocated. On 32-bit platforms it is \frac{15}{2}(2^{31}-1) < 2^{34} bits (2 bytes for every 15 bits). On 64-bit platforms the theoretical limit is \frac{15}{2}(2^{63}-1) < 2^{66} bits. This requires insane amount of memory. This is much more than any existing computer have or can have in observable future. In both Intel64 and AMD64, the 52-bit physical address limit is defined in the architecture specifications. 64-bit computing - Wikipedia

I propose to impose hard limit of 2^{63}-1 or 2^{64}-2 bits for Python integers. With such limit (if forbid creation larger integers even if the size in bytes can fit in 64 bits) the number of bits can always be expressed as 64-bit integer (signed or unsigned), and some complex code in longobject.c and mathmodule.c could be simplified. Currently it is written to support integers larger than such limit, although this cannot be tested (and likely could not be tested in this century).

This can be implementation limit, not language limit.

I proposed this idea in the discussion about gh-121485: Always use 64-bit integers for integers bits count by serhiy-storchaka · Pull Request #121486 · python/cpython · GitHub, and @mdickinson liked it. But this is a fundamental change and should be discussed among all core developers.

oscarbenjamin · August 29, 2024, 1:28pm

Is it a fundamental change?

If it is possible to change this now then it would be similarly possible to change it again in future if it somehow becomes desirable to exceed this limit. That is a purely theoretical concern for the foreseeable future though anyway as you say since at the least it would seem to require supporting some future architecture that does not even exist yet.

A more realistic concern: are there any compatibility implications for downstream extension modules resulting from this change?

I presume that this really only actually changes anything in a 32-bit build because otherwise size_t would already be 64 bits anyway right? So basically this is just making the type bigger on 32 bit system so that the code can be simpler?

Does there exist a supported platform where size_t is more than 64 bits?

notatallshaw · August 29, 2024, 1:49pm

Why is this in “Core Development” and not “Ideas”? It would be a big breaking change for many users (e.g. anyone who ever uses scripts to write Number Theory or Cryptography) and would presumably require a PEP?

I’ll note that limiting integers to be capped by default to printing 4,300 digits (approx 14,284 bits) without prior announcement caused significant backlash: Int/str conversions broken in latest Python bugfix releases.

Or am I misunderstanding, and this is an internal implementation detail that affects C code and not Python code?

Edit: Apologies, reading again, this is almost certainly “CPython implelentation integer” and not “Python Integers”. I hope you appreciate why I found the termanology confusing.

malemburg · August 29, 2024, 1:57pm

Hmm, that would make working with Googolplex numbers hard in Python

If my math is correct, they need around 2^{334} bits to store in memory. But hey, it’s going to take a while before computers can manage such numbers down to the last bit with full accuracy, if ever.

Seriously, I think size_t is a good measure for what computers can realistically handle in the foreseeable future. That value should be 64 for most of today’s machines and we’re not even close yet to having that much RAM available.

barry-scott · August 29, 2024, 2:59pm

There are only 10**80 atoms in the observable universe.
Use that as your limit on memory size.

pitrou · August 29, 2024, 4:41pm

Definitely agreed with this proposal. It’s extremely unlikely that it would bother anyone.

pitrou · August 29, 2024, 4:43pm

I think you misread, this is capping the size of integers to 2**63 - 1 bits, not their value. Their maximum value would therefore be 2**(2**63 - 1).

gpshead · August 29, 2024, 10:19pm

This seems uncontroversial to me. We can never represent exact values such as int so large that the RAM to represent it can’t possibly exist.

encukou · August 30, 2024, 6:53am

The C API is kind of between that, as some other implementations re-implement or emulate it.
I expect the public version of _PyLong_NumBits will return int64_t, rather than size_t. Which is fine: in a future where it’s needed, we can introduce a new function and let the old one fail.

storchaka · August 30, 2024, 7:17am

No, it only affects 64-bit platforms. I just made bit counts 64-bit on all platforms, this solves issues on 32-bit platforms (it required only 0.5 GB to create an integer whose size in bits exceeds size_t).

What I propose here, is to change the limit on 64-bit platforms. Currently it is about 2^{63} bytes or 2^{66} bits. I propose to reduce it to about to 2^{63} or 2^{64} bits (2^{60} or 2^{61} bytes). Both limits are insanely large and exceed the maximum size of addressable memory on most of current platforms.

Even if they exist, Python is not a right tool to work with exbibyte-size integers.
Even adding two of such integers will take yeasr.

Well, it seems that there are no objections. I am not still sure about using signed or unsigned 64-bit integers, but this detail will be discussed on GitHub and in the C API Workgroup.

avi.gross · August 30, 2024, 3:51pm

\usepackage{mathtools}

The proposal is not outlandish if we look at the realities of implementing anything so large that might not be possible far enough into the future that python as a language will have been superseded.

But do people sometimes, perhaps inadvertently, use larger numbers? Sure.

Consider the typical formular when looking at permutations and combinations where if you want N objects arranged K at a time you blithely use the factorial function as in:

Screenshot 2024-08-30 114625

If some person just wonder how many ways the 10 to the 80th calculated particles in the universe can be arranged in a straight line, they run out of memory and computer time very quickly.

Of course, in some future, it may be possible for very networked computers to collectively share storage for larger numbers and do operations on them, or perhaps you could even somehow use quantum computers to more compactly store larger items.

As noted, a special-purpose extended type could be added in such eventualities.

kumaraditya303 · August 30, 2024, 4:16pm

Note that with Intel level 5 paging it can use 57-bit virtual address space.

I agree that we should limit this based on how much RAM can represent nowadays.

franklinvp · August 30, 2024, 4:48pm

Just a clarification, such that judgement of the idea is not done in terms of the following incorrect statement:

Every individual number, no matter how large, can be represented exactly and only one bit is needed for that. There is no such thing as

int so large that the RAM to represent it can’t possibly exist

What requires space is a structure that can represent (has different states for) every one of a very large collection of numbers.

It is not inconceivable to have int represent every single integer up to some magnitude while after that switch to an encoding that is only able to represent some, but not all integers beyond that point. The same happens everywhere. We do not have representations for all integers. There are recipes (like positional system) that would do so, but all of them under the assumption “provided enough space” that eventually fails. But even past that failure, switching to different representations (2^{22222222}, 2^{2^{2^{2}}}, 2\uparrow\uparrow\uparrow\uparrow\uparrow\uparrow\uparrow2, \operatorname{TREE}(\operatorname{TREE}(2))) one can hit larger values, although not all, and do some operations with them too, but not all.

I cannot judge if the same is expected/desirable in the near future of Python but all of these are phenomena that currently exist in Python:

Changing internal representation after some value. int does that.
Not all real numbers having representatives: int and float do that.
Not all operations are valid: 1 // 0 and str of large int, for example.

No opinion either way. I only wanted that choices are made based on correct premises.

oscarbenjamin · August 30, 2024, 5:47pm

I don’t think anyone is objecting but I do have a question that went unanswered:

It is not clear to me from the PR diff whether it is all just changes to internal types that don’t touch any public (or non-public) API that downstream code might be using. It is not unheard of to access the internals of PyLong because it is slow (fix here) to convert large integers otherwise.

jeff5 · August 30, 2024, 7:20pm

All the arguments do seem to be informed by implementation details, so I think it must be an implementation limit.

Obviously asking for the return of sys.maxint would be a mistake here . Do we need something like int.maxbits or sys.maxintbits?

dg-pb · August 30, 2024, 7:23pm

Maybe sys.int_info.maxbits?

gpshead · August 30, 2024, 8:27pm

Why bother exposing the “limit” anywhere? Nobody can use int values even approaching that large anyways. The limit is effectively already in place, it’s just that you currently potentially wind up getting a MemoryError or other not quite as explanatory exception.

tjreedy · August 30, 2024, 9:35pm

‘Integer’ is a mathematical abstraction denoting a particular doubly infinite ordered set. Unless ‘Python integer’ redundantly means either ‘integer’ or ‘Python int’', I think it should mean “Python object with an integral value”. And that should not have a hard limit, and indeed such could hardly be imposed on user classes.

Python ints are an implementation that can represent all integers between lower and upper bounds. The practical limits depend on the machine and currently available limits. The theoretical limits for the implementation as is on 64 bit machines is the address space size. The proposal is to divide by about 8 so the bit limit, rather than byte limit, will fit in an int64, so as to simplify internal C int code. So this would be an implementation limit.

Since this will have no effect on practical limits for decades, at best. I think go ahead. If any extensions dig deep enough into the implementation to be affected, I think they should adjust. But best to get this into .a1 to find out as soon as possible.

Unless there is already a sys.int_bit_limits or equivalent, there would need to be a good reason to expose one now.

dg-pb · August 30, 2024, 9:46pm

It is useful information that contributes to description of int and there is a natural place for it. All <number>_info define bounds if they exist, in both Python and numpy (possibly many others too).

How often do other values in sys.int_info get used in code? This would probably be one of the more useful ones in there.

Also, it would be a small safeguard against the unknown and unforeseen at minimal cost.

skirpichev · August 31, 2024, 3:45am

This looks more suitable for specialized computer algebra system and far beyond scope of a general purpose language like Python. Even algorithms used for integer arithmetic here are school-level (with few exceptions like Karatsuba multiplication). Hardly this will be changed, unless CPython switch to external bignum library.

I think that this use case isn’t affected by OP PR.