Expected performance characteristics of subinterpreters

jpivarski · May 14, 2024, 1:19am

I’m testing the subinterpreters interface because I’ll likely be taking advantage of it to improve scaling of this library across multiple threads. Subinterpreters are attractive because:

I need more flexible shared memory than multiprocessing can provide. Most of my parallelized work fills different parts of shared arrays. Although multiprocessing.SharedMemory is an option, I need to fill a lot of arrays of different sizes and I want users to be able to delete some of them after reading and (get their memory back). Allocating lots of SharedMemory objects as named files in the filesystem is not a good option.
I need to release the GIL for different threads of Python code. Although the important parts of the parallelized work are in compiled extensions that release the GIL (decompression and NumPy), there are enough Python steps between the GIL-released steps that scaling is killed by Amdahl’s law.

So subinterpreters seem like a perfect fit. I’ve read PEP 554 and PEP 734 and have been eagerly awaiting the beta release. (Queues/channels didn’t work in the alpha release.)

I just tried it out and learned two things:

Launching new subinterpreters is slower than launching new processes. This is a surprise to me. (I’ll show code below.)
Sending data to and from subinterpreters is a lot faster than sending data to external processes. This is not a surprise.

Here’s some code and some timing numbers on a 3 GHz computer with 16 cores in Linux. All scripts have the same imports:

import time
import multiprocessing
import threading

from test import support
from test.support import import_helper

_interpreters = import_helper.import_module("_interpreters")
from test.support import interpreters

First, to compare launching times of subinterpreters and processes:

def in_subinterp():
    2 + 2

def in_thread():
    subinterp = interpreters.create()
    subinterp.call(in_subinterp)
    subinterp.close()

starttime = time.perf_counter()

so_many = []
for _ in range(10000):
    so_many.append(threading.Thread(target=in_thread))

for x in so_many:
    x.start()

for x in so_many:
    x.join()

print(time.perf_counter() - starttime)

and

def in_process():
    2 + 2

starttime = time.perf_counter()

so_many = []
for _ in range(10000):
    so_many.append(multiprocessing.Process(target=in_process))

for x in so_many:
    x.start()

for x in so_many:
    x.join()

print(time.perf_counter() - starttime)

Launching 10 thousand subinterpreters took 11.1 seconds, while starting 10 thousand processes took 7.3 seconds; a factor of 1.5. It was a lot worse with call_in_thread (74.4 seconds for the subinterpreters), but I think that might have been blocking between calls, at least partially. Above, both scripts start a suite of 10 thousand threads/processes and the interpreters start independently in each, calculate 2 + 2 (to be sure they’ve really started), and then shut down. If thread start-up times are equal to process start-up times (it’s Linux), then starting each subinterpreter is doing something that costs… 6 ms more than forking? ((11.1 - 7.3) / (10000 / 16)… something like that.)

As for upper limits, the number of processes is constrained by ulimit, which is problematic for a Python library because ulimit is configured outside of Python. Subprocesses don’t seem to have a limit, though on one of the tests of 10 thousand subinterpreters, I got this non-reproducible error:

  File "/home/jpivarski/tmp/subinterpreters/many-subinterps.py", line 17, in in_thread
    subinterp = interpreters.create()
  File "/home/jpivarski/tmp/subinterpreters/Python-3.13.0b1/Lib/test/support/interpreters/__init__.py", line 76, in create
    id = _interpreters.create(reqrefs=True)
interpreters.InterpreterError: interpreter creation failed

Next, to measure communication times:

def in_subinterp():
    from test.support.interpreters import queues

    to_subinterp = queues.Queue(to_id)
    from_subinterp = queues.Queue(from_id)

    total = 0
    while True:
        obj = to_subinterp.get()
        if obj is None:
            break
        total += obj

    from_subinterp.put(total, syncobj=True)


to_subinterp = queues.create()
from_subinterp = queues.create()

starttime = time.perf_counter()

subinterp = interpreters.create()
subinterp.prepare_main({"to_id": to_subinterp.id, "from_id": from_subinterp.id})
subinterp.call_in_thread(in_subinterp)

for x in range(10000000):
    to_subinterp.put(x, syncobj=True)

to_subinterp.put(None, syncobj=True)

total = from_subinterp.get()

print(time.perf_counter() - starttime)

and

def in_process(to_process, from_process):
    total = 0
    while True:
        obj = to_process.get()
        if obj is None:
            break
        total += obj

    from_process.put(total)


to_process = multiprocessing.Queue()
from_process = multiprocessing.Queue()

starttime = time.perf_counter()

process = multiprocessing.Process(
    target=in_process, args=(to_process, from_process)
)
process.start()

for x in range(10000000):
    to_process.put(x)

to_process.put(None)

total = from_process.get()
print(time.perf_counter() - starttime)

Sending 10 million integers to one subinterpreter using a Queue took 6.1 seconds, whereas sending 10 million integers to one process using a Queue took 43.0 seconds. That’s a factor of 7 in subinterpreter’s favor, and I expected something like that.

For completeness, here’s a script to get a baseline (single-threaded, performing the same computation):

starttime = time.perf_counter()

total = 0
for x in range(10000000):
    total += x

print(f"main {total = }")

print(time.perf_counter() - starttime)

It took 0.8 seconds, so the work that the subinterpreter Queue is doing to send data is about 8× more expensive than adding integers; about 0.5 ms per queue item ((6.1 - 0.8) / 10000000). Not bad!

The next thing that would be interesting to test is the scaling of Python code that updates a shared array in subinterpreters. I personally believe that the scaling would be close to perfect (only run into issues at ~1 Gbps due to all threads trying to pull data over the same memory bus, like a C program), but I couldn’t test it because I don’t know how to install a package like NumPy with a manually compiled Python (I normally let Conda manage the environments) and this workaround using ctypes:

big_array = array.array("i")
big_array.fromfile(open("/tmp/numbers.int32", "rb"), 16*10000000)
pointer, _ = big_array.buffer_info()

# pass the (integer) pointer to the subinterpreter and

big_array = (ctypes.c_int32 * (16*10000000)).from_address(pointer)

didn’t work because

Traceback (most recent call last):
  File "/home/jpivarski/tmp/subinterpreters/subinterp-multithread.py", line 24, in in_subinterp
    import ctypes
  File "/home/jpivarski/tmp/subinterpreters/Python-3.13.0b1/Lib/ctypes/__init__.py", line 8, in <module>
    from _ctypes import Union, Structure, Array
ImportError: module _ctypes does not support loading in subinterpreters

If all of the performance results above are as expected, then it suggests a usage strategy for subinterpreters:

Create them sparingly, like OS threads/processes, and unlike green threads. I can see now why a pool-style interface is anticipated: we’re not going to want to create and destroy these subinterpreters often. I don’t know why they’re noticably more heavyweight than processes, but even if they were equal, this usage strategy would apply.
Communication with the subinterpreter (using shareable data types) is relatively inexpensive. It only costs 0.5 ms or so to send data through a Queue. It’s certainly good enough for sending brief instructions to a mostly-autonomous subinterpreter.
The bulk processing should operate on arrays owned by the main interpreter. With NumPy, a pointer or a memoryview can be sent to each subinterpreter and they can be viewed with np.frombuffer. Although I couldn’t test it, this kind of work ought to scale as well as C code.

Is that right? This is a question for the subinterpreter developers—are these rough performance numbers and interpretations in line with what you expect?

jpivarski · May 14, 2024, 1:26am

Quick follow-up: I ought to install Python 3.13.0b1 and NumPy (from git) in a Docker container. That way, I can use pip to install NumPy and know that I’m not clobbering any Conda environment.

Inside Docker:

./Python-3.13.0b1/python -m ensurepip
./Python-3.13.0b1/python -m pip install git+https://github.com/numpy/numpy.git
# then use NumPy...

zware · May 14, 2024, 5:02pm

Thanks for testing it out! I suspect you already realize this, but do note that the subinterpreters PEP(s) have not yet been accepted (though I fully expect them to be eventually), so you are treading on somewhat shaky ground here.

cc @eric.snow; he’s the driving force behind subinterpreters, but I’m not sure if he follows this section of the forum.

jpivarski · May 14, 2024, 5:47pm

Uh-oh! I didn’t know that. I signed up to do a talk based on this (many months from now), so if it doesn’t get added to 3.13, I may need to retract that. Not a problem, though, and I see you’re confident that it will be accepted.

I’ll keep testing it out and giving feedback as a user!

a-reich · May 15, 2024, 2:00am

I’m a bit surprised by your finding on startup time as well - note that this blog post by Anthony Shaw concluded from another benchmark that subinterpreter startup is currently 10x faster than multiprocessing startup.

jpivarski · May 15, 2024, 2:04pm

That’s why I wrote: the start-up time is not what I’d expect. The benchmark in that blog post is:

def bench_threading(n):
    # Code to launch specific model
    for _ in range(n):
        t = Thread(target=f)
        t.start()
        t.join()

def bench_subinterpreters(n):
    # Code to launch specific model
    for _ in range(n):
        sid = subinterpreters.create()
        subinterpreters.run_string(sid, "")

def bench_multiprocessing(n):
    # Code to launch specific model
    for _ in range(n):
        t = Process(target=f)
        t.start()
        t.join()

which differs from my test in that the threads and processes are required to start and finish in serial (each thread/process has to finish before the next can start), and the work in the subinterpreters is also done serially (each run_string has to finish before the next can start), although the subinterpreters are not closed, which is different from how the threads/processes are treated.

In my tests, I created all of the threads, subinterpreters, and processes before starting any of them and let them all run in parallel (with subinterpreters in threads on the main interpreter), measuring wall time. The work to do the subinterpreters is strictly more than the work to do threads in my test, since the subinterpreters run in threads, but I was more interested in the comparison with multiprocessing anyway.

The scale seems to be consistent: Anthony Shaw saw about 9 ms per subinterpreter, which is the same order of magnitude that I saw (6 ms). We’re doing different tests and comparing different things.

tunedal · May 15, 2024, 4:57pm

The default start method will change to "spawn" in 3.14 but is still "fork" in 3.13 on Linux, so the new processes are presumably skipping the work of initializing the interpreter. Wouldn’t that explain the results?

Adding multiprocessing.set_start_method("spawn") at the top of your module should put multiprocessing on an equal footing with subintepreters in that case.

I thought subinterpreters used os.pipe() to communicate and would thus have basically the same performance as multiprocessing here, so I was surprised.

Are the subinterpreter channels using a different and more efficient mechanism? If so, I hope that part is not what’s getting cut from the PEP and moved to PyPi.

jpivarski · May 15, 2024, 8:07pm

I added

multiprocessing.set_start_method("spawn", force=True)

to the start of the script (force=True is necessary). Both Python 3.10 and 3.13.0b1 now take 330 seconds to launch 10 thousand processes, compared to (still) 11.8 seconds to launch 10 thousand subinterpreters, so the conclusion swings the other way!

This 5.5 minutes for multiprocessing is beyond linear scaling: it scales linearly up to about 1 thousand processes, and in that regieme, it’s 5 ms per process. Just launching the python command hundreds of times (with a bash loop) yields 10 ms per python -c 'exit()', so that sounds about right.

Anthony’s measurement of multiprocessing start-up is 100 ms per process, but this his test script runs them serially, whereas I launch them in parallel. 5-ish ms may be the start-up time that can’t be parallelized and there’s an additional 95 ms to send the task and wait for it to finish sequentially.

So here’s a perhaps more appropriate comparison:

import multiprocessing
multiprocessing.set_start_method("spawn", force=True)

import time

def in_process():
    2 + 2

if __name__ == "__main__":
    starttime = time.perf_counter()

    so_many = []
    for _ in range(1000):
        so_many.append(multiprocessing.Process(target=in_process))

    for x in so_many:
        x.start()

    for x in so_many:
        x.join()

    print(time.perf_counter() - starttime)

and

import time
import multiprocessing
import threading

from test import support
from test.support import import_helper

_interpreters = import_helper.import_module("_interpreters")
from test.support import interpreters


def in_subinterp():
    2 + 2


def in_thread():
    subinterp = interpreters.create()
    subinterp.call(in_subinterp)
    subinterp.close()


if __name__ == "__main__":
    starttime = time.perf_counter()

    so_many = []
    for _ in range(1000):
        so_many.append(threading.Thread(target=in_thread))

    for x in so_many:
        x.start()

    for x in so_many:
        x.join()

    print(time.perf_counter() - starttime)

yields 5.66 seconds for a thousand processes and 1.10 seconds for a thousand subinterpreters. With spawn (initializing the interpreter in each process), I find that threads+subinterpreters launch 5× faster than processes in parallel. (Anthony finds 10× sequentially.)

I didn’t think so, and I hope not! There’s no reason why a subinterpreter would need to send data through an OS pipe, with the OS calls and the serialization that implies.

In principle, it needs to either limit itself to sharable types or deeply copy mutable objects (except the data referred to by a memoryview), but it can do that in the shared heap of the process, while being kept in distinct subinterpreter spaces for Python objects (i.e. the list that each subinterpreter’s garbage collector manages). From the performance numbers, I assume that’s what it’s doing.

alicederyn · May 15, 2024, 8:33pm

Presumably you’d want to compare against the fastest implementation of multiprocessing, i.e. force use of fork in Python 3.14, not bias your results towards the “desired” outcome of subinterpreters being faster?

jpivarski · May 15, 2024, 9:04pm

I don’t have a stake in this, a desire for subinterpreters to be faster—I just want to see that performance numbers make sense, as a way of verifying that my mental model of what subinterpreters do is correct. The actionable knowledge that I got from this is that subinterpreters are not orders of magnitude faster to launch than multiprocessing, so as a user, I wouldn’t use them in a different way. (Before doing this, I thought it might be reasonable to launch a subinterpreter for each task, but now I know that I won’t be doing that. I’ll be using a pool, like one would with threads or processes.)

In an early test, before my first message, subinterpreters seemed to be 7× slower than multiprocessing, which was very surprising, since I’d expect them to be doing strictly less work. In my first message, subinterpreters seemed to be 1.5× slower than multiprocessing, which is still surprising. But now I know that forked Python processes skip a step of initialization, so it’s no longer surprising (not strictly less work). With spawned processes (now strictly less work), subinterpreters seem to be 5× faster than multiprocessing, which at least is making sense.

Even if everything about subinterpreters was the same speed as multiprocessing (both launching and sending data), subinterpreters would be a better fit to my use-case because:

The total number of processes is controlled by something outside of Python: the ulimit (because multiprocessing uses os.pipe to communicate). Subinterpreters don’t have that limitation (or I haven’t observed it in 10 thousand subinterpreters). Users of my library won’t be sending complaints that it doesn’t work on their system, only to find out that their system has a different ulimit.
I should be able to share NumPy arrays between subinterpreters and update them in place, as long as each subinterpreter acts on different array indexes. (I still need to get NumPy into Python 3.13 and test that!) With multiple processes, I’d have to use OS SharedMemory, which has constraints on it that I can’t pass on to my downstream users.

So I was doing these performance tests to make sure there aren’t any upcoming surprises.

(Incidentally, I did test multiprocessing on both Python 3.13 and Python 3.10, because I have the latter easily accessible elsewhere. Python 3.13 spawns processes about 30% faster than Python 3.10. Maybe there’s more progress coming in Python 3.14, but it can’t be a giant factor, like the 5× we see between multiprocessing and subinterpreters in Python 3.13, right?)

a-reich · May 17, 2024, 2:18am

I also wanted to experiment with sharing numpy arrays across interpreters, using early early builds of 3.13, but couldn’t get it to work (even with importlib.util._incompatible_extension_module_restrictions) - using numpy in a subinterpreter would crash. This is my biggest concern with subinterpreters currently, the ecosystem support - I don’t know of any popular extensions that are possible to use. Cython and PyO3 to my knowledge explicitly block such imports and little progress has been made there on supporting it.

AndersMunch · May 17, 2024, 2:01pm

@jpivarski, I think you’re seeing regular Python threading GIL lock contention. Once the subinterpreters are underway and running in_subinterp, there’s no lock contention, because the subinterpreters each have their own GIL. But for a few seconds, all the threads are trying to run in_thread at the same time, which is basically a worst case for CPython performance.

In a realistic production scenario, where you’re creating threads+subinterpreters in a more spread out fashion, the overhead from creating subinterpreter threads may be much smaller.

jpivarski · September 30, 2024, 10:52pm

Update: in Python 3.13.0rc2, the array library now works in subinterpreters. The following can parallelize an array operation over 16 subinterpreters, all running in different threads:

import time
import multiprocessing
import threading
import array
import ctypes

from test import support
from test.support import import_helper

_interpreters = import_helper.import_module("_interpreters")
from test.support import interpreters

from test.support.interpreters import queues


big_array = array.array("i")
big_array.fromfile(open("/tmp/numbers.int32", "rb"), 16 * 10000000)
pointer, _ = big_array.buffer_info()


def in_subinterp():
    print("begin in_subinterp")

    from test.support.interpreters import queues

    done = queues.Queue(qid)

    import ctypes

    big_array = (ctypes.c_int32 * (16 * 10000000)).from_address(pointer)

    for i in range(10000000):
        big_array[start + i] *= 10

    done.put(start)

    print("end in_subinterp")


done = queues.create()

just_16 = []
starts = set()
for i in range(16):
    print(f"{i = }")

    subinterp = interpreters.create()
    subinterp.prepare_main({"pointer": pointer, "start": i * 10000000, "qid": done.id})
    subinterp.call_in_thread(in_subinterp)
    just_16.append(subinterp)
    starts.add(i * 10000000)

dones = set()
while dones != starts:
    dones.add(done.get())

for x in just_16:
    x.close()

print(big_array[:25])

However, attempting to use NumPy causes a segfault. I’ll be looking into this more and following up with NumPy.

ngoldbaum · October 1, 2024, 1:36am

NumPy doesn’t support subintepreters yet. To do so (at least as far as I know) would require converting NumPy’s extension modules to use multi-phase initialization.

ncoghlan · October 1, 2024, 2:48am

Back on the original fork vs subinterpreter startup time question: as folks surmised, there’s no copy-on-write implementation for subinterpreter state initialisation, so their startup times ending up in between fork (process overhead, but uses copy-on-write to minimise initialisation time), and spawn (both process and initialisation overhead) is an expected outcome.

Wombat · October 2, 2024, 10:59pm

I share your surprise. Being strictly in userspace and reusing startup state should make subinterpreter creation/destruction fast. The slow start might be because the code is still rough.

ncoghlan · October 3, 2024, 2:01am

Comprehensively reusing startup state doesn’t currently happen. There’s some sharing of immutable state, and the process level resources are shared (this is why subinterpreters are faster than spawn mode processes that have to fully initialise a new main interpreter), but the kind of wholesale memory page sharing that makes fork (and forkserver) mode processes so quick to launch isn’t an available option.

It’s part of the constellation of problems around finding ways to take full advantage of the shared memory access to more efficiently share dynamic state between subinterpreters.