Faster large integer multiplication

enok · January 23, 2022, 7:03pm

I’m considering looking at improving the multiplication of Pythons built-in integers. There are faster methods than Karatsuba which is currently used in Python to multiply large integers. Also perhaps a larger digit size would be beneficial on modern processors. Today only 15- and 30-bit digits are supported.

Multiplying two 10^7 bit integers takes a few seconds on my laptop (Python 3.9). One realistic goal could be to achieve 10^8 bit multiplication on the same time without any assembler code. A faster multiplication would automatically give improved division performance as well.

Or is there any argument against such an effort? Would it have any chance of being accepted?

barry · January 23, 2022, 7:17pm

I would defer to @tim.one

tim.one · January 23, 2022, 7:24pm

I added Karatsuba to Python two decades ago, and I’ve said I regret doing so. I would oppose removing it, but it “shouldn’t have been” added to begin with.

There’s basically no end to this, GMP will always run faster in real life, with a group of numeric experts thoroughly dedicated to keeping it at the top of the heap. It’s an enormous amount of difficult code, and just doesn’t belong in core CPython. You can count the number of numeric experts contributing to CPython on a fraction of one hand’s fingers .

Python users can get the benefit of their extreme efforts merely by installing the gmpy2 extension.

BTW, the decimal module implements multiplication algorithms fancier than Karatsuba, but that’s effectively leveraging an external library too.

enok · January 23, 2022, 7:45pm

Karatsuba for multiplication is a no-brainer IMO. It’s not much code and a great value. Beating GMP is not achievable. But since Python has a large integer type it seems like a value for the users to put some effort into reasonably efficient basic arithmetics on that type? What is the cost? Increased compile time?

Using gmpy2 requires conversion between Python int and gmpy2.mpz type. Not to mention the need to download and possibly compile a pretty big package.

And isn’t “decimal” about arbitrary-precision floating-point arithmetic?

steven.daprano · January 23, 2022, 8:42pm

Hi Tim,

Speaking on behalf of the many Python users who (for various reasons)
have not or can not installed GMP, but nevertheless appreciate Python
bignums and relatively fast multiplication, please don’t regret adding
Karatsuba.

We are grateful for your work and have never been disappointed that
multiplication in pure Python is too fast or wished it were slower.

I don’t think that maintaining the Karatsuba code has been a constant
drain on your time. Am I wrong?

I remember (only just!) the discussion on comp.lang.python (I think it
was?) when you added Karatsuba. In the decades since, I think I can
count the number of proposals to improve the algorithm on the fingers of
one finger wink so that argues against your statement that “there is
no end to this”. I think that one update to the multiplication algorithm
every two decades hardly counts as churn.

And finally, with respect, you’re not as young as you were twenty years
ago. If there are only two numeric experts working on CPython (you and
Mark Dickson, apologies if I have missed anyone else) perhaps we should
not be discouraging a newcomer who, just maybe, could carry the
baton when you or Mark retire.

guido · January 23, 2022, 8:48pm

The cost is the effort to review the code before it can go on and the effort to maintain it forever after. Please don’t do this.

steven.daprano · January 23, 2022, 8:47pm

Oskar,

You may like to look at this thread:

https://mail.python.org/archives/list/python-dev@python.org/message/ZICIMX5VFCX4IOFH5NUPVHCUJCQ4Q7QM/

tim.one · January 23, 2022, 8:53pm

Karatsuba is a large amount of code compared to grade school multiplication. Every line is another potential source of errors, and an eternal maintenance burden. You’ll write your fancy new code once and then go away, leaving it to others to clean up problems forever after .

Compiling packages is just business as usual for Linux-heads. On Windows there are compiled GMP libraries ready to download.

Who is the real target audience? Popular Python packages like mpmath already exploit GMP if it’s installed. People who need very large integers typically need many different kinds of operations on them to “run really fast”. It doesn’t end with multiplication.

Arbitrary precision floats are one thing decimal supports, but that includes arbitrary precision integers (although you need to set the maximum precision in advance). Indeed, when bigint input and/or output eed to be in base 10, decimal can be the way to go for integer arithmetic - CPython’s power-of-2 to/from power-of-10 I/O base conversion algorithms take quadratic time. In decimal, base-10 I/O is close to “free”.

For example, here I cut a user’s runtime from hours to under a minute just by switching to use decimal.

Using a larger base for CPython’s internal “digits” holds more attraction to me, but is also fraught with potential problems. As is, we rely on that a double-width digit product can be obtained in portable plain old C, and held in a native platform C type with at least one bit to spare (and while memory of details has faded, I seem to recall there’s at least one place in the code that relies on that there are at least two bits to spare),. CPython gets ported to a lot of platforms, and the amount of time porters to brand new platforms have had to spend getting Python’s bigints “to work” is approximately 0 seconds.

GMP is eager to use 64-bit “limbs” (digits) on modern boxes, but they also don’t hesitate to dip into assembler to get at the HW instructions that make doing so practical. C is a poor language for this purpose.

But that’s enough from me. In short, my position is that people who really need fast bigint arithmetic are - and always will be - better off using GMP, which is already available, while bigints are “fast enough” already in CPython for casual bigint users. If we had a dozen numeric experts regularly contributing who were keen to add & maintain fancier stuff, fine - but we don’t. In the old days I fixed every numeric problem that showed up, but now it’s much more Mark Dickinson.

Simple changes are welcome, but more than that requires weighing the likely tiny benefit for most users against the likelihood that the original author won’t maintain their work. Karatsuba relies only on very simple algebra, but the more-advanced methods require background most programmers just don’t have. The heavy lifting in decimal is done by libmpdec, an external project maintained by Stefan Krah.

tim.one · January 23, 2022, 9:21pm

No, it’s that there’s no end of ways multiplication could be made ever-faster by implementing algorithms more advanced than Karatsuba. Indeed, that’s what the OP here wants to do.

At the start, it required a long series of checkins before Karatsuba was ready for general use. You could only get at it then by setting the envar KARAT. The writeups you can find about “how to do it” are thoroughly inadequate except for its best case: multiplying two ints with exactly the same number of bits. What if it’s a million bits by 20 thousand? The naive approach in that case ends up being slower than grade school multiplication.

I had to stumble into a better approach. Which has, apparently, been rediscovered many times by now: view that as a 50x1 digit grade school multiplication instead, using Karatsuba to multiply 20000-bit “digits”.

So while Karatsuba is as simple an “advanced” method as there is, there are still subtleties, and they all require more & more code to resolve. It doesn’t “just work” (well, not always efficiently), and someone without serious domain knowledge will thrash at random trying to get it to work well in all cases.

Karatsuba is not “a no-brainer” unless someone who implemented it didn’t stick around long enough to deal with the bad cases bound to pop up .

oscarbenjamin · January 24, 2022, 1:02am

The difference between gmpy2’s mpz and Python’s int is not particularly noticeable until you get up to integers much larger than many Python users would ever want. Even then it’s not like gmp is orders of magnitude faster. The extra speed from extra algorithms is definitely a niche use case so it’s hard to justify since anyone who really needs it can just pip install gmpy2.

Also since no one mentioned it I thought I’d mention flint which adds a bit on top of gmp although it’s not so easy to pip install (conda install python_flint works I think). One of flint’s innovations is actually being faster for small integers at the same time as large integers although that might not be so noticeable when used from Python rather than C.

For the vast majority of Python users speeding up operations with small integers would be much more useful than large integers which are already blazing fast even if other libraries are a bit faster. Of course this really means reducing interpreter overhead rather than implementing fancy integer algorithms.

Other things look more like low-hanging fruit: another thing that gmpy2 has which Python’s stdlib does not is a builtin implementation of rational numbers. Using gmpy2’s mpq rational type instead of the fractions module brings a very noticeable speed difference for numbers of all sizes (much more than using gmpy2’s mpz over Python’s int).

enok · January 24, 2022, 9:09pm

Ok I agree it’s not a big effort to install gmpy2. And conversion between longint and gmpy2.mpz is O(n) operation so insignificant for huge integers.

So perhaps there are more useful ways of spending one’s time.

I’m attaching a plot with execution times for multiplication of two n-bit numbers using Python (3.6 as it happens) built in longint vs gmpy2 (2.0.8). The platform is my 4 year old laptop (Xeon i7-5600U) running Linux. I’m plotting number of multiplications per second on the y-axis and factor bitlength on the x-axis, with log scale on both axes.

It looks like gmpy2 is significantly faster even with “small” integers, around 6 times faster at 1000 bit factors, then in addition asymptotically faster from around 10000 bits so that 10^8 bits are multiplied at roughly the same time as Python does 3*10^6 bits. At 10^7 bits gmpy2 is around 100 times faster.

I’m guessing that the faster “small” integer performance can be due to bigger 64-bit “limbs” in gmp vs Pythons 30-bit digits, plus gmp’s assembler implementation. Apparently the penalty for calling a C extension function is not significant.

The plot seems to support Oscar Benjamins statement that speeding up smaller longint operations may be possible (and of bigger interest).

guido · January 24, 2022, 9:36pm

Yeah, we’re looking into this, see More efficient implementation of integers · Discussion #147 · faster-cpython/ideas (github.com)

tim.one · January 24, 2022, 10:55pm

Nice graph, Oskar! Thanks.

Yes, the number of limbs (internal digits) is probably the driving factor for “smaller” sizes. School book multiplication needs a number of hardware multiplies equal to the product of the multiplicands’ number of limbs. So with limbs half the width, Python will need to ask the hardware for 4 times as many multiplies (as GMP asks for) if the multiplicands’ widths are just doubled.

But it’s worse than just that, because the ratio is actually 64/30 ~= 2.13, and that squared is about 4.5. GMP’s assembler can also exploit things like “add with carry” primitives, which are nearly impossible to get at from portable C.

That’s why boosting the limb size is more attractive to me than anything else of this ilk: it would speed +, -, *, and /, and also for operand size ranges too small to overcome the overheads of fancier algorithms.

I expect the improvement in GMP asymptotics you saw kicked in when GMP gave up on Karatsuba and switched to a fancier algorithm.

I can pretty much guarantee that Oscar had in mind there native machine-size ints, no more than 64 bits. E.g,. 11 * 54 is enormously slower in Python than in C - and also enormously slower in GMP. This has almost nothing to do with the multiplication algorithm used - it’s almost entirely about the overheads in layers of function calls eventually figuring out that there’s almost nothing to be done .

oscarbenjamin · January 24, 2022, 11:47pm

Indeed I did.

oscarbenjamin · January 25, 2022, 12:12am

It’s not my call but I would personally make good use of a C implementation of rational numbers in the stdlib if such a thing existed. If you’re looking for something to work on then to me at least that seems a clearly valuable contribution.

enok · January 25, 2022, 7:50am

I suppose you don’t want to add a few assembler intrinsics, controlled by preprocessor macros, to speed it up on the most important compilers+targets?

Here is another possibly impossible idea - how about a configuration option that selects GMP as integer type? “–enable-big-digits=gmp”.

enok · January 25, 2022, 10:36am

And yes, it’s the Schönhage-Strassen implementation in gmp that outperforms Karatsuba for huge integer size, which is the reason for that asymptote with different angle in the plot

tim.one · January 25, 2022, 6:06pm

Inline assembler depends on details. CPython does almost none of that, anywhere. For example, would this be favoring “most important” platforms at the expense of slowing things down on other platforms? Or do we end up with entirely different algorithms depending on platform (which would make the code more brittle to maintain)?

So there’s no general answer. Depends on details. Note a pattern here: for an established project with millions of users, we’ve learned the hard way that the cost of implementing a thing is dwarfed by the never-ending costs of maintaining it. Complicating the code base is always a hard sell - but for good reasons.

Similarly for a configuration option to use GMP instead of Python’s own int type. That would not be a simple project, and it’s even unclear if anyone would use the result. People who have actual need for fast bigints in Python already use gmpy2, where they also get instant access to the large pile of functions the GMP family provides, which also pursue peak speed regardless of implementation complexity (e.g., I recently lost count of how many different approaches it can employ to implement comb(n, k)). And if what I need is actually bigint comb(), “just” having faster * and // leaves me far short of what can be done to speed that.

Replacing (unconditionally) Python’s ints with GMP’s would be a different thing, and has been tried. Here’s the last attempt I’m aware of. It fizzled out because it was, at the time, slower than sticking to Python’s own int type. There appear to be multiple reasons for that, but chiefly because most ints, in most programs, most of the time, are in fact native platform integers. Guido already posted a link to a discussion of ways to try to exploit that.

enok · January 25, 2022, 9:14pm

Everything you write make sense Tim, thanks for taking the time to explain. Apparently more things have already been tried than I expected.

But, how about then combining the idea of special representation for small integers with the attempt you mentioned to replace larger integers with GMP mpz? Seems like a potentially competitive combo.

oscarbenjamin · January 25, 2022, 10:05pm

I’m not an expert on these things but I would have thought that the biggest obstacle to using gmp in a non-optional way would be the LGPL license.