Packaging interpreter embedding frameworks

Background

I work on cocotb, a framework for testing simulated HDL code, which is written in C++ and Python. Our framework is loaded by a running simulator application and a Python interpreter is embedded.

Recently, a contributor pointed out an issue with the RPATH on our libraries. They contain hardcoded paths to the libpython used to compile the extension modules in our project. This works fine if you build the package in the environment you use it in. However, this is untenable for binary distribution (something we haven’t done yet). His issue has to do with pip reusing the wheel in another conda environment after the original conda environment was destroyed.

Question

How should projects that have extension modules that embed a Python interpreter into some other process source libpython into the library load path?

Potential Solutions

Just the one’s I’ve thought of

  1. RPATH pointing to system libraries absolutely (current solution): broken
  2. RPATH pointing to system libraries relatively: I don’t think we can make the assumption that the site-packages are installed into the same directory structure as the python install, so probably not feasible
  3. LD_LIBRARY_PATH: not handled automatically by virtualenv or conda right now, so it’s on the user, which is not great…
  4. RPATH pointing to a libpython shipped with the package: sounds just awful to manage

What’ version of Python are you targeting? The problem has been solved in 3.8.0 and newer. On most Unix platforms (except Android and Cygwin) Python extensions are no longer linked against libpython. https://bugs.python.org/issue21536 has more details.

We are targeting 3.5 - 3.9-dev right now. I tried replicating with Python 3.8 and there is no failure, but 3.7 does fail. The 3.7 and 3.8 extension modules have identical RPATHs and linked libraries. I’m not sure I understand what changed and why this works now.

The 3.8 extension no longer has a dependency on libpython. Python 3.8 assumes that libpython is already loaded into the global namespace of the current process. That means you no longer need an rpath that includes libpython.so for 3.8 and newer.

Python 3.7 and earlier

$ ./configure --enable-shared
$ make
$ LD_LIBRARY_PATH=. ldd build/lib.linux-x86_64-3.7/math.cpython-37m-x86_64-linux-gnu.so 
        linux-vdso.so.1 (0x00007f12b2313000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f12b2182000)
        libpython3.7m.so.1.0 => ./libpython3.7m.so.1.0 (0x00007f12b1e4a000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f12b1e28000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f12b1c5e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f12b2315000)
        libcrypt.so.2 => /lib64/libcrypt.so.2 (0x00007f12b1c23000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f12b1c1c000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f12b1c15000)

Python 3.8 and newer

$ ./configure --enable-shared
$ make
$ LD_LIBRARY_PATH=. ldd build/lib.linux-x86_64-3.8/math.cpython-38-x86_64-linux-gnu.so 
        linux-vdso.so.1 (0x00007ffec5ef5000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f2658f1e000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2658efc000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f2658d32000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f26590ae000)

My attempt with 3.8 only worked on accident, it was picking up my system 3.8 install (not intended) and not the one in my conda environment. Consider it also failing.

Python 3.8 assumes that libpython is already loaded into the global namespace of the current process.

Well that’s the crux of the issue and how this relates to a packaging problem. libpython is not loaded by the current process. Our project is an application extension that embeds a Python interpreter. The application isn’t the Python executable, nor does it load libpython of it’s own accord. How/when we load libpython once we have it is arbitrary and not a problem for us. Our issue is how to find libpython, and how that relates to how we package our project.

Our current solution (which wasn’t thought out, we just arrived there one day) uses the RPATH to locate libpython, which is a bad idea for the reasons explained above. We want to support binary distribution so we need the built binaries to be relocatable and be able to find libpython on whatever system it’s installed on. What’s the best way to do that? I have the list above of possible solutions, but I’m not sure if I’m missing a possible solution that is cleaner or if there are hidden issues in any of the solutions.

If you’re creating a binary that is loaded into a non-Python main process, and your binary loads a copy of libpython that you ship with your code, in what sense is this a Python extension? It sounds to me that Python’s packaging tools simply aren’t the correct solution for you.

Of course, it’s quite possible that you can get the packaging toolchain to work, but it wouldn’t be a supported configuration. It’s also possible that I’ve completely misunderstood what you’re trying to do :slightly_smiling_face:

You are mostly correct, but we don’t ship libpython right now (and would like to avoid it). This is a good visual overview of what the project constitutes. We have pure C++ libraries, we embed a Python interpreter, there is a Python support library, and an extension module to call back into the C from Python.

I’m guessing Python’s packaging utils are only really designed for:

  1. Python being the driver
  2. Not having any pure C libraries, only directly using the C in extension modules
  3. Not reusing the C sources outside of the project

Currently it works, so we are going to stick with it for now, but it’s obvious that half of the project is not “typical”. We will probably go with option 3 if there is no recommended best practices.

Most of the Python ecosystem depends on you using a “normal” distribution. For embedded scenerios, you’re unfortunately on your own.

Unless you’re doing a very light integration (like Vim or Emacs), I’d strongly recommend distributing libpython and any packages you need yourself and keeping well away from the system installs. Your app will be far more reliable and simpler for your users. The standard library can be trivially reduced to about 15MB and as far as 10MB with some care, or even further if you know which bits you don’t need.

Alternatively, if users need to be able to install their own Python packages, I’d suggest distributing your integration as an extension module and just require the user to get a full install themselves that you can invoke as a process and rely on the module to interact (like how OBS and Blender do theirs). It’s a bit more work on the user side, but less than if they need to force a numpy install into your private copy. The trick here is to avoid interacting with libpython at all from your code (other than the extension module) and use a spawned Python process instead.

In the worst case, you could execute a script with Python to locate its libpython, but you won’t necessarily get the same result if you then load it directly. So I’d really suggest keeping full control over libpython by including it, or have no control by only spawning a process and use inter-process comms.

So I’ve done some reflection and realized that our project works well currently because the build step binds the package to the installed environment. This means that our package, by it’s very nature, is not relocatable. And from what @steve.dower is saying, making it relocatable means opening a can of whoop-ass on ourselves. I think for now it’s best our project not press forward on making the package re-distributable in binary form.

So now the only problem is fixing the wheel reuse issue (which is a form of binary redistribution). Does pip contain some way of marking a package as not valid for cached wheel reuse, or perhaps that creating wheels is simply invalid? This is really only an issue with conda because each environment has what appears as a separate Python install.

https://pip.pypa.io/en/stable/reference/pip_install/#caching

conda specifically supports an interpreter per environment (it’s an attraction for its users). It sounds like you might want to use conda package instead of pip ones in those scenarios.