JIT performance possibilities

hypernova · July 8, 2024, 11:25am

Question about this JIT implementation.

As far as I could tell from reading PEP-744, the JIT will do something to compile Python byte code into machine code before it is executed.

What I don’t understand is why does that only give a 5 % increase in performance. (I have seen this figure quoted in more than one source.)

Is the fact that a Python bytecode layer exists an inherent limitation?
If so, why not get rid of the bytecode layer altogether, or at least aim to do so at some point in the future?

I am by no means an expert in compiler implementation, but Python is known for being an anomaly in the world of software engineering because it is so exceptionally slow.

Compare it to Node.js, for example. Node has pretty good performance despite the fact that it executes JavaScript. IIRC Node executes all code using 4 threads, two of which do IO, one of which does the actual execution of the JavaScript once it has been through the Node JIT stage, the other thread I do not recall exactly what it does.

You can compare the performance of Node to Python by restricting yourself to code which is CPU intensive, and does not do much IO. (The sensible thing you would do by default anyway.)

Unless I have misunderstood, you typically expect to get much better performance out of a Node.js webserver than a Python one.

A commonly quoted figure is that Python is about 20x slower than most other languages. There is some variance between all languages, but Python stands out as being the anomaly.

Why is this?

Is it because Python generates a bytecode which perhaps limits performance in a way that Java (also compiles to bytecode) does not?
It is an inherent limitation of Python’s memory model (object model). Again I am no expert, but the object model requires a lot of indirection (following pointers) to figure out how to execute a function on an object. On the other hand, I was recently told that CPUs have lots of hardware optimizations to help with virtual function calls and following pointer indirection.
Is the JIT just new technology? Should we anticipate seeing much greater improvements in runtime performance in the future?
What is the theoretical limit of the JIT? Could we anticipate numbers like a 500% performance gain at some point in the future? (Unless my figures are off, a 5x performance improvement would put Python on par with some of the other languages, maybe Java, for example.)

PS: I’m aware that for the context of performing big data operations with Numpy or DataFrames, the fact that all the code is implemented at the C level changes the situation somewhat. But this limits Python to a small subset of problems. I don’t know how you would implement a webserver using a Pandas DataFrame, for example.

Basically -

The 5% performance gain figure seems a bit too small
and more generally, I’m interested to know what future performance gains might be anticipated or could be achievable

Thanks

jamestwebber · July 8, 2024, 1:50pm

Short answer: It’s because right now the JIT is an experiment and the initial focus was on implementing it, not optimizing its performance. This is just the first step out of many.

The long answer is really long and I’m not qualified to answer it–there’s a ton of literature out there to read. You might also enjoy this episode of the core.py podcast. I haven’t listened yet but it’s on my list…

brettcannon · July 8, 2024, 7:28pm

That number is only mentioned in the PEP once and it’s for the criteria to turn on the JIT by default. The PEP says performance is flat compared w/ the JIT.

I don’t think it’s fair to make such a huge, sweeping statement about something as subjective as “Python is slow” and that it’s performance characteristics are an “anomaly” for an interpreted language w/ such a rich object model. Now you could reasonably say we are faster/slower than a specific language, especially in specific benchmarks, but I disagree w/ categorically saying we are slow.

One thing to note w/ this comparison is Node runs on v8 which is 15 years old and I believe has had a JIT from the start while having full staffing from Google that entire time. Compare that to us who just got a JIT in December after companies like Microsoft started funding performance work a couple of years ago along w/ other companies and the help of volunteers and it isn’t a straight comparison between JITs.

Probably, but there’s no guarantee. With the JIT only being 6 months old there’s still a ton of optimizations to implement to both overcome the overhead a JIT introduces as well as take advantage of what a JIT lets us do. It’s just going go take time and effort to move things forward.

hypernova · July 8, 2024, 11:05pm

I don’t know any languages which are slower than Python. But this is besides the point.

My reply was not to say “Python is bad”
(it isn’t “bad” because it is slow, the fact that it is slow is just a disadvantage)
(I still use Python everyday, and I use it because performance is not a priority for what I am doing)

The point of my reply was to ask about what might it be realistic to expect from the (seemingly relatively recent) interest in improving performance, and to find out what might be an inherent limitation due to some other factor. (I mentioned the object model already.)

(I think we can all agree there has been little to no serious interest in improving the runtime performance of Python before - lets say - 3.11.)
Prior to this, seemingly most people just denied that performance matters whenever Python is mentioned…
…or as has just been shown in this case, denied that Python is slow, despite the fact that anyone who has used it to build any serious production software is well aware of this limitation, and a huge amount of industry effort is put into working around it

Just to weigh in with my own opinion on this so you can see where I am coming from:

If Python is too slow for your use case, you are using the wrong tool for the job

which is a neutral statement. I neither claim Python is good nor do I claim is it bad.

With that out of the way, I would still be interested to learn more detail about some of my previous questions. I’m not particularly interested in this thread being side tracked into a “my language is better than yours” argument.

oscarbenjamin · July 8, 2024, 11:50pm

I imagine you would have better speedup with np.arange.

This is missing the point though. @hypernova asked some reasonable questions. Perhaps they were not phrased tactfully but the followup clarifies:

We all know that there are many basic things that are faster in C++ than in Python so we don’t need to debate that or construct artificial benchmarks to argue about it. Sometimes those things don’t matter but the general speed differences are known and sometimes they definitely do matter.

This whole thread is to discuss a proposal that might improve this disparity in performance for many elementary operations. Hopefully it will but that is not known yet. Either way it is reasonable for someone to ask questions about it here.

antonagestam · July 9, 2024, 6:48am

I would recommend checking out the faster-cpython project and its various pieces of threads, documents and presentations. It’s what has been driving changes to the languages like adding a JIT and the prior work to add a specializing interpreter.

IIRC, the initial stated goal for that project is a 5x speedup over Python 3.10.

I think your assessment is correct that this early work, and introduction of the JIT itself doesn’t do much, but it instead unlocks many future speedups.

You also mention memory layout, and that has been looked at and improved within the scope of this project as well. Process startup time is another factor that’s been looked at.

hypernova · July 9, 2024, 6:51am

I can’t reply to his post because it has been flagged as abusive, again. But -

Let’s not get hung up on this benchmark - it will always be possible to fudge either code one way or the other by writing a worse code in one language or a better code in another.

The reason why I chose the Python code I did with a list rather than a numpy array was to deliberately avoid vectorization. In other words, to have a code which is running at the Python interpreter level and not at the C level. Using Numpy or DataFrame would have been pointless, because then you are comparing C to C++.

oscarbenjamin · July 9, 2024, 11:33am

The distinction between C and Python is not really so clear cut like this. All Python operations ultimately translate to something that is implemented in C (at least for CPython).

Apart from the interpreter overhead the big difference between C++ and NumPy here rather than the Python code that you showed is the use of machine precision integers on the stack rather than Python’s heap allocated arbitrary precision integers.

Ultimately if the code needs to create the object list(range(1000000)) then this object needs to be represented in memory somehow and the current way that this is done in CPython requires 1000000 heap allocated PyLong structs. The JIT or adaptive interpreter can only do so much to improve performance if the output necessarily involves 1000000 heap allocations.

On the other hand the list created is not used for anything and all the code that you time is just dead code. Here is the C++ function that you timed:

#include <vector>

void run() {
    std::vector<unsigned long long> my_array(1000000);
    for (std::size_t i = 0; i < my_array.size(); ++ i) {
        my_array[i] = i;
    }
}

This is what happens if I compile it with -O2 and look at the generated machine code:

$ gcc -c -std=c++20 -O2 t.cpp
$ objdump -d t.o

t.o:	file format mach-o arm64

Disassembly of section __TEXT,__text:

0000000000000000 <ltmp0>:
       0: d65f03c0     	ret

There is only one instruction ret in the compiled function. The function was reduced to a no-op by the compiler because all of its code is just dead code. If we also had a function that called the run function then the compiler would likely optimise the call away entirely.

Dead code elimination is an important optimisation in combination with other optimisation techniques but benchmarking purely dead code is usually not interesting. A more realistic function that loops over range(1000000) and does something useful would need to have some meaningful output which would likely not be the list of 1000000 integers since that is better represented in memory by the range object itself.

Something like return sum(range(1000000)) is a slightly better benchmark although some compilers can optimise this to use the formula for triangular numbers. It is still not a completely fair comparison though because even with unsigned long long the operation would overflow for 10**10 in C but the Python code would not overflow up to extremely large inputs even if running more slowly for smaller inputs.

A more interesting optimisation for CPython would be if the adaptive interpreter could identify that range(1000000) will generate machine sized integers and could use them directly without allocating PyLong structs on the heap. If the adaptive interpreter could do that then the corresponding microinstructions might be combined by the copy and patch JIT to produce something that much more closely resembles the machine code for the corresponding operations in C.

Alternatively if small integers could be represented inline somehow rather than on the heap then that would likely speed up many things. I believe that is one of the optimisations Node uses to speed up things like looping over range(1000000).

timfelgentreff · July 9, 2024, 11:53am

It’s early days for the JIT in CPython, but there’s nothing inherent in Python preventing a JIT from optimizing this much better, so I would expect with time CPython’s JIT will do much better on this also.

Whenever performance optimization is talked about, people come around with small code snippets that may or may not represent the actual use of the language, but just for fun I tried the snippet from hypernova on PyPy and GraalPy (simply installed the latest of each via pyenv) and it ran ~30x faster than on CPython on my machine (not scientific, just ran the script a dozen times on each interpreter and averaged). I think for this kind of thing, CPython’s JIT might very well reach good speedups, too, question is whether this means anything for actual real world applications, though.

Rosuav · July 9, 2024, 12:39pm

The latter is also done by some other languages. It adds a measure of complexity that would have to be accepted by every C extension, but can drastically speed things up. The most straight-forward way would be for any integer between -2**62 and 2**62 (or 2**30 on a 32-bit build) to be represented as n*2+1 instead of a pointer. Which means that a PyObject* with the low bit set isn’t actually a pointer, it’s an integer with no heap storage. The use of this value for the object ID (as is done for all other objects in CPython) would still be valid. There’d need to be provision for allocating a bound method object attached to this non-heap object, though I suspect that’s no different from any other case. These pseudo-objects wouldn’t have reference counts, but I don’t think that would be a problem, as it’d be the same as other immortal objects? Might require special-casing sys.getrefcount to always return 2 or something.

I think the biggest problem would be that this would force special-casing of integers in all kinds of super-generic calls, though. ANYTHING, literally anything, that looks at a PyObject*, would need to first check to see if it’s an integer. But it would give a huge improvement in memory (de)allocation for moderate-sized integer objects, and it would be invisible to Python code (unlike the Py2 split between int and long).

oscarbenjamin · July 9, 2024, 2:05pm

There is a recent faster-cpython proposal to do a variation of this. If I understand correctly a pointer would be stored as ptr-1 having 11 in lowest bits and small integers would be stored as something like (unsigned)val << 2 having 00 in lowest bits. This representation would possibly only exist in the interpreter stack rather than being presented to third party extensions or other CPython internals.

da-woods · July 9, 2024, 6:00pm

I think mypyc does something similar for integers where they can be a number if small and a pointer if big (although only for things where the type has been inferred as an integer and not universally, so it doesn’t have to pay the cost for any Python object)

brettcannon · July 10, 2024, 4:10am

That is such an open-ended question I think it’s effectively unanswerable. The hope is the JIT will be faster, and w/ it being neutral in perf suggests that all the low-hanging optimizations will lead to wins. But that’s as far as anyone can go right now in terms of certainty.

That’s simply not true. Please go read the history of e.g. the Unladen Swallow project.

I’m going to ask that all comments about Python being fast or slow, suggestions that the core developers didn’t take performance seriously or at all, etc. not continue. They come off as at least dismissive for those of us who have worked to make Python as fast as it is over it’s storied history. I will also ask that comments stay on topic about the PEP; people can always start new topics about general JIT questions if they don’t fall under the PEP.

MegaIng · July 10, 2024, 9:45am

You can point out ways in which point python has drawbacks compared to other languages. Just make sure to not lie or misrepresent the efforts others have done, especially after being explicitly told to not do that.

If while doing that you insult the developers of that language, it would get flagged. (And also, if it’s completely off topic it would probably get flagged as well)

oscarbenjamin · July 10, 2024, 10:49am

There is plenty of point in this forum existing and no it is not used to suppress information. You can have a perfectly reasonable discussion about these things here if you do it respectfully. (And I also think that others could respond less defensively/more respectfully as well.)

This thread though is not really the right place to have a general discussion about these things though. At this point the discussion here has now hijacked this thread which is supposed to be about the PEP proposal. Probably the last ~20 posts should be moved to a different thread.

pitrou · July 11, 2024, 9:20am

As a seasoned core developer I am a bit surprised by the defensivity of some responses here. I did contribute my share of performance improvements over the time (though I’m not involved in the impressive work sparked by @markshannon a couple years ago), and I cannot imagine being offended by the suggestion that CPython’s interpreter is historically slow because, well… it is.

A lot of work was poured into projects like Cython or Numba, for example, to avoid interpreter overhead in specific cases. Past projects like Psyco or Unladen Swallow, to name a couple, have tried to speed up the interpreter. PyPy, at the time, was started with the belief that it was difficult to make CPython faster, and that an entirely different architecture and community setting were necessary. These are facts that are well-known by anyone who has been interested in the subject of Python performance in the last 20 years.

And, yes, CPython and its ecosystem can still be plenty fast for some tasks, especially where its open architecture and rich C API allow for fruitful collaboration between bytecode and native code.

jamestwebber · July 11, 2024, 2:37pm

I think that’s where the problem arises. If one isn’t familiar with the history it is easy to make statements that seem dismissive of all the work that’s been done so far.

nas · July 11, 2024, 7:03pm

This is a pretty good starting point:

Most ideas that have been suggested have been tried already. E.g. tagged pointers for things like integers that fit into machine words is a common implementation technique. I did a quick-and-dirty prototype some years ago during a core sprint. It was surprisingly easier than I thought to kind of get working. Making it work for real would massively be more work. The performance gains were not amazing.

Python is hard to optimize for a number of reasons and so comparing it to something like Javascript is not very helpful. For CPython, one of the big constraints is compability with C extensions. The extensions are a big part of Python’s success. For many other language implementations, doing an extension like numpy (where you add new types to the language that integrate fairly seamlessly) is not so easy. Alternative Python implementations have struggled to efficiently support extensions (both pypy and Skybison, for example). The Skybison example is interesting in that the core part was quite a bit faster than CPython (as I understand) but once C extensions were used, a lot of gains disappeared.

A lot of work is still going on (faster-cpython project, free threading, C API overhaul). If you are interested in this kind of thing, do some reading first to find out what’s been done and what’s being worked on.

dg-pb · July 11, 2024, 8:01pm

Regarding general performance of Python

I think Python is not as slow as some think.

Naive python usage is often slower in comparison. However, if one is artful in writing efficient Python, there are many ways to make it faster. And when one utilises an appropriate combination of performance improvement techniques it is often possible to write code, which in performance is highly competitive with other languages.

(of course if one doesn’t compare apples with pears)

In time with all the effort being put into it I think python will be as fast as other languages of similar abstraction level without the need of being very artful about it.

Regarding JIT

A quick read of Python 3.13 gets a JIT answered most of my questions about what it is and how it differs from other JITs.

Of course, this is only high level overview, but it was enough for me to put it into general context.

Few questions to those who know more about it:

Is this article accurate/correct?
Is something important missing?
Is there something that should be corrected?

oscarbenjamin · July 11, 2024, 9:58pm

I get the impression from previous discussions that the two of us have had that we do not write similar code for solving similar problems.

It is also often not possible to write Python code that is highly competitive with other languages. I have hit these limits many times and done measurements, benchmarks, tried many variations of optimisation and so on. Many times I have ended up writing at least part of the code in another language like C or writing wrapper code to call into something like a C library.

I have used other languages like this because I found that there was just no way to get within a factor of 10x (sometimes 100x!) the speed that the C library achieves and for me that is very often too big a speed difference to pass over. If that kind of speed difference isn’t a problem for you then you are likely writing very different code from me for solving very different problems.

It is a strength that Python makes it sort of easy to call into other languages but actually as a maintainer of a Python package that wraps a C library I can tell you that packaging that up is a lot of work. I would rather that the need for this was much less so that we could all do it less often and just write/use Python code that could run fast enough for the task at hand.