Running subinterpreters in multiple threads succeeds in Python 3.12 but not 3.13

HanatoK · May 6, 2025, 8:17pm

I have the following C++ code that launches 4 threads. Each thread calls the Py_NewInterpreterFromConfig and runs a python script named gettid.py:

#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
#include <string>
#include <filesystem>
#include "unistd.h"
#include "Python.h"

void execute_python_code(
  PyInterpreterConfig config, const std::string& module_name, const std::string& func_name, std::mutex& print_mutex) {
  const auto currentDir = std::filesystem::current_path();
  // PyGILState_STATE gstate;
  // interpreter 1
  PyThreadState *tstate1 = NULL;
  // gstate = PyGILState_Ensure();
  PyStatus status1 = Py_NewInterpreterFromConfig(&tstate1, &config);
  if (PyStatus_Exception(status1)) {
    Py_ExitStatusException(status1);
    std::cout << "Failed\n";
    return;
  }
  PyThreadState_Swap(tstate1);
  PyObject *sysPath = PySys_GetObject("path");
  PyList_Append(sysPath, PyUnicode_FromString(currentDir.c_str()));
  PyObject* myModule = PyImport_ImportModule(module_name.c_str());
  PyErr_Print();
  PyObject* myFunction = PyObject_GetAttrString(myModule, func_name.c_str());
  PyObject* res = PyObject_CallObject(myFunction, NULL);
  // PyGILState_Release(gstate);
  if (res) {
    const int x = (int)PyLong_AsLong(res);
    {
      std::lock_guard<std::mutex> guard(print_mutex);
      std::cerr << "C++ thread id = " << gettid() << ", python result = " << x << std::endl;
    }
  }
  // destroy interpreter 1
  PyThreadState_Swap(tstate1);
  Py_EndInterpreter(tstate1);
}

int main(int argc, char* argv[]) {
  std::mutex print_mutex;
  std::string pymodule;
  std::string pyfuncname;
  if (argc == 3) {
    pyfuncname = std::string(argv[2]);
    argc--;
  } else {
    pyfuncname = "gettid";
  }
  if (argc == 2) {
    pymodule = std::string(argv[1]);
    argc--;
  } else {
    pymodule = "gettid";
  }
  Py_InitializeEx(0);
  PyInterpreterConfig config = {
    .use_main_obmalloc = 0,
    .allow_fork = 1,
    .allow_exec = 1,
    .allow_threads = 1,
    .allow_daemon_threads = 0,
    .check_multi_interp_extensions = 1,
    .gil = PyInterpreterConfig_OWN_GIL,
  };
  const int num_threads = 4;
  std::vector<std::thread> threads;
  for (int i = 0; i < num_threads; ++i) {
    threads.emplace_back(
      execute_python_code, config, pymodule, pyfuncname, std::ref(print_mutex));
  }
  for (int i = 0; i < num_threads; ++i) {
    threads[i].join();
  }
  int status_exit = Py_FinalizeEx();
  std::cout << "status_exit: " << status_exit << std::endl;
  return 0;
}

The python script gettid.py simply prints the thread ID:

#!/usr/bin/env python3
import threading
def gettid():
    return threading.get_native_id()

With python 3.12, I can compile the code and run it successfully:

C++ thread id = 37174, python result = 37174
C++ thread id = 37175, python result = 37175
C++ thread id = 37177, python result = 37177
C++ thread id = 37176, python result = 37176
status_exit: 0

With python 3.13, the C++ code seems stuck at Py_NewInterpreterFromConfig. Any ideas?

da-woods · May 7, 2025, 4:57am

If nothing else, the documentation for Py_NewInterpreterFromConfig says that it must be called with the GIL.

HanatoK · May 7, 2025, 4:05pm

Thanks! If so, what is the correct way to run different sub-interpreters in different threads? I have tried calling Py_NewInterpreterFromConfig from the main thread, and then calling PyThreadState_Swap in new threads, but PyThreadState_Swap seems still deadlocking.

da-woods · May 7, 2025, 4:39pm

Mysteriously, if you look at the two places in the Python code-base where it’s used, they actually don’t look to hold the GIL (because they’ve just swapped it out for NULL).

github.com/python/cpython

Python/crossinterp.c

5ea24116b


      
          /*************/
          
          PyInterpreterState *
          _PyXI_NewInterpreter(PyInterpreterConfig *config, long *maybe_whence,
                               PyThreadState **p_tstate, PyThreadState **p_save_tstate)
          {
              PyThreadState *save_tstate = PyThreadState_Swap(NULL);
              assert(save_tstate != NULL);
          
              PyThreadState *tstate;
              PyStatus status = Py_NewInterpreterFromConfig(&tstate, config);
              if (PyStatus_Exception(status)) {
                  // Since no new thread state was created, there is no exception
                  // to propagate; raise a fresh one after swapping back in the
                  // old thread state.
                  PyThreadState_Swap(save_tstate);
                  _PyErr_SetFromPyStatus(status);
                  PyObject *exc = PyErr_GetRaisedException();
                  PyErr_SetString(PyExc_InterpreterError,
                                  "sub-interpreter creation failed");
                  _PyErr_ChainExceptions1(exc);

github.com/python/cpython

Modules/_testinternalcapi.c

5ea24116b


      
                   || whence == _PyInterpreterState_WHENCE_LEGACY_CAPI)
          {
              PyThreadState *tstate = NULL;
              PyThreadState *save_tstate = PyThreadState_Swap(NULL);
              if (whence == _PyInterpreterState_WHENCE_LEGACY_CAPI) {
                  assert(config == NULL);
                  tstate = Py_NewInterpreter();
                  PyThreadState_Swap(save_tstate);
              }
              else {
                  PyStatus status = Py_NewInterpreterFromConfig(&tstate, config);
                  PyThreadState_Swap(save_tstate);
                  if (PyStatus_Exception(status)) {
                      assert(tstate == NULL);
                      _PyErr_SetFromPyStatus(status);
                      exc = PyErr_GetRaisedException();
                  }
              }
              if (tstate != NULL) {
                  interp = PyThreadState_GetInterpreter(tstate);
                  // Throw away the initial tstate.

Which suggests that you can’t trust the documentation (or me) on this - sorry!

I suspect loosely copying their use is the way to go. But not sure I’m a very reliable source on this.

CAM-Gerlach · May 8, 2025, 9:46am

As the world’s foremost expert on subinterpreters, perhaps @eric.snow might be able to weigh in on this.

HanatoK · May 9, 2025, 4:01pm

I applied the following patch that seems working in both 3.12 and 3.13 (with or without free-threading), but still don’t know if this is the correct solution:

--- main2_old.cpp       2025-05-09 10:56:24.533657095 -0500
+++ main2.cpp   2025-05-09 10:55:48.327840691 -0500
@@ -68,6 +68,7 @@
   };
   const int num_threads = 4;
   std::vector<std::thread> threads;
+  auto* save_state = PyThreadState_Swap(NULL);
   for (int i = 0; i < num_threads; ++i) {
     threads.emplace_back(
       execute_python_code, config, pymodule, pyfuncname, std::ref(print_mutex));
@@ -75,6 +76,7 @@
   for (int i = 0; i < num_threads; ++i) {
     threads[i].join();
   }
+  PyThreadState_Swap(save_state);
   int status_exit = Py_FinalizeEx();
   std::cout << "status_exit: " << status_exit << std::endl;
   return 0;