Towards standardizing cross compiling

benfogle · August 27, 2021, 8:09pm

I am the maintainer of crossenv, a tool that creates a special virtual environment such that once set up, pip wheel numpy will cross-compile a wheel for a different architecture. While this is definitely a niche tool, it works well enough that it has started to see some interest in various other projects. (The one in particular that made me write this is Cross compiling aarch64 wheel · Issue #598 · pypa/cibuildwheel · GitHub)

I would like a way to maintain crossenv in a less hacky way, especially as other people start to rely on it. Currently to make it work, I patch a handful of standard library modules and a few third party modules. It’s the third party modules that give me more trouble; they move much faster. It’s also the third party modules that require the most workarounds, and the most likely to just not work. For example, manylinux support is disabled by default because the detection algorithms break when crossenv is active.

I don’t know that a cross-compiling environment, as I have done it, is the only or best way of cross-compiling Python extension modules, but I’ve found it very helpful dealing with the wealth of setup.py’s already out there. It would be very useful to have an accepted way of communicating to other tools that they are cross-compiling. It would make it easier to submit patches to other projects to improve cross-compiling support. It would make it easier to for other projects to mock cross compiling for testing. It would make it easier for new language support (think setuptools-rust) to have cross-compiling support from the start. It might play nicely with PEP 517, too.

I would like to hear what the community at large thinks of this. If there is interest in this, I would be happy to try and put together a PEP. I am not proposing that cross-compiling become officially supported.

Some initial thoughts: Right now, you can detect cross-compiling on *nix by checking if sysconfig.get_config_var('HOST_GNU_TYPE') != sysconfig.get_config_var('BUILD_GNU_TYPE'). I also set an environment variable PYTHON_CROSSENV=1 and add sys.cross_compiling. Maybe adding a module, like the _manylinux module technique in PEP 600 would be better?

steve.dower · August 27, 2021, 9:20pm

I am definitely interested in this, as I’m looking into how to enable cross-compilation on Windows (since the ARM64 compilers don’t actually run on win_arm64 devices, and so few people have ARM64 devices yet anyway). More generally, I’d like to be able to build all variants of a wheel from a single environment per platform (rather than per platform+version+architecture).

Currently the platform detection is standard-by-implementation in packaging.tags, which suggests that it’s probably the place to standardise either figuring out host vs. target (or allow a user override, which is my preference - env variable, probably). Then it’s up to the build backends to be able to select the right build commands (or again, allow a user override). Tools like crossenv count as a “user override” here - they can set the right env vars for locating compilers, etc.

Given the style of past PEPs in this space, I think the right approach is to try and standardise the config_setting value passed in the build_wheel hook (added by PEP 517). That gives build front-ends and backends the communication channel needed here, and allows the user to configure their frontend rather than having to handle each package based on its backend.

steve.dower · August 27, 2021, 9:31pm

For reference, cross-compilation on Windows basically just requires:

choosing the right compiler (typically from the same install)
having the right libs folder (import libraries)
knowing the right .pyd tag

The rest is easy, and I’d be quite willing to write a backend helper module to get the import libraries on demand. Unfortunately, knowing the right .pyd tag is impossible without just hardcoding it for each possibility. And even if you have the right libs, the build backends don’t all allow specifying where to look (without modifying the package itself). I think this can all be fixed with sysconfig overrides though.

benfogle · September 3, 2021, 3:08am

I’m glad to hear that there is some interest. I did some work to spin up on cross compiling on Windows, just to be sure that what I’m saying isn’t going to be too *nix centric. (Building an ARM64 Python was shockingly easy, by the way, compared to Linux.)

Vocab for below. Based on the GNU autotools convention that CPython itself uses:

Host: the system that will run the binary. Embedded device, etc.
Build: The system that produces the binary. Desktop or server, usually.

Here are what I think some of the main hurdles are:

The sysconfig module is the main place for info about the host, including, on *nix, compile/link flags. It would be nice to preserve that so that most tools won’t have to do anything different. When cross-compiling, the main step is to set the environment variable _PYTHON_SYSCONFIGDATA_NAME to make sysconfig report data for the host, rather than the build system. (This, plus a patch or two to platform is enough to get packaging.tags mostly working.)
Build and runtime dependencies need separate path handling. For example, if a package requires Cython, we don’t want pip cross-compiling Cython for ARM then attempting to run it on x86. And we don’t want build dependencies polluting the host’s sysroot. The grossest patches in crossenv are for handling this distinction. Hopefully just exposing the fact that we are cross-compiling can shift that handling into pkg_resources or wherever it really belongs.
Information usually known only at runtime presents a special challenge. Example: sys.platform is used to determine compilers, but also used by subprocess to call the right implementation of Popen. So sometimes it needs to be “right” and sometimes it needs to be “wrong.” This precludes compiling Windows on Linux and vice versa. There are lots of little things like that all over the ecosystem. They’ll need to be caught and fixed as we find them.
Third party build dependencies may want to know if they are cross compiling. This would be useful for Cython, and pkg_resources might want to alter its search algorithm to distinguish between runtime and build dependencies. We will need a way of giving them this information beyond a PEP517 hook.
A host of little things that don’t handle cross compiling well. For example, packaging checks for manylinux support on linux-arm by examining the ELF headers of sys.executable. This fails when cross compiling, and packaging would fail even harder if sys.executable actually pointed to an ARM binary. Again, we’ll need to fix these as they are uncovered.
I’ve twice mentioned “a bunch of little things” to fix. I’m a little concerned that the initial version of a PEP might not specify all the information needed to fix one of these little things. No one wants a whole bunch of cross compiling PEPs.

Thoughts on implementation

As suggested, pass cross compiling info to build_wheel. Probably a dictionary of information, including _sysconfigdata.py for both host and build (or at least a method of locating it). It would also need to include runtime information (like what os.uname() would return), but we might also provide predetermined runtime information by platform name.
Exposing this information to various third-party build dependencies is a bit murkier. However it happens, I think we should expose the information at a very low level, and another package not defined here would provide a friendly API:
- We could set environment variables, but I am concerned about them proliferating. crossenv already sets five. The other problem with this is that some of these environment variables (_PYTHON_HOST_PLATFORM, etc.) should really be considered implementation details. I don’t want to standardize them or create a parallel standard.
- The user or frontend could create a module, similar to PEP600’s _manylinux module. This is slightly more complicated, and an unusual way of passing information around, but it keeps everything in one place and is much more extensible. It would lend itself well to creating isolated environments. I would expect this to be a pprint-ed version of whatever gets passed to the build_wheel hook, following _sysconfigdata.py

mattip · September 3, 2021, 4:23am

I am surprised Cython needs any cross-platform information. I thought Cython produces platform- and python-version-agnostic C code, with many #ifdef guards for the different systems. At least that is the theory behind projects including the cython-generated C files in their sdist bundles.

benfogle · September 3, 2021, 12:15pm

You’re right, it doesn’t. I’m just using Cython as a well-known stand-in for code generation tools generally. It’s certainly conceivable that Cython or a similar tool might be able to generate better code if it knows something about the target.

The problems I’ve actually had with Cython relate to build vs runtime dependencies, which isn’t Cython’s problem to solve.

rgommers · September 3, 2021, 3:37pm

Thanks for sharing @benfogle, this is an interesting topic.

Commenting on this sentence, but it applies more generally: it seems to be that the standardization and PEP ideas you are talking about are at odds with assuming setuptools is the only thing that matters. If your aim is to improve the cross-compiling situation for setuptools users, then that’s a very useful thing to do - but it doesn’t have much to do with standardization and may just need a setuptools issue/PRs. Identifying the things in the stdlib that need better support (like sysconfig) is probably the only thing that may be relevant to other build systems.

Just a heads up that it’s very likely we’ll move NumPy away from distutils (to Meson). That’s what I’d recommend for packages that are as complex as NumPy - general cross-compiling is quite hard, so if that’s what one wants it’s better to use a build system that has built-in support for it, like Meson or scikit-build/CMake.

steve.dower · September 3, 2021, 4:33pm

build_wheel in this case is the PEP 517 hook with that name, not the setuptools command.

Except now we’re back tying build information into CPython releases, which is what we were trying to get out of by enabling other build systems. So the “cross compiling info” mentioned above should be standardised enough that a front end can read or infer it from somewhere, and the backend can understand what it’s being given and turn it into a correct invocation of whatever tool it is running.

rgommers · September 3, 2021, 8:31pm

Ah okay, sorry for the noise there.

Those two things seem unrelated. Other build systems are a significant improvement irrespective of how tight the CPython coupling is. And can you really get away from a CPython release completely? There’s CPython-version-specific headers and C APIs, so it seems unlikely at least in the short to medium term.

Agreed. But let me try once more: building (as in, produce a bunch of Python extension modules) is a lower-level concept than a wheel, or a virtual environment. I’d like to be able to cross-compile for any target artifact (e.g. a conda package or an rpm). And also using a conda env (or for Nix, or whatever env is managing Python and other build dependencies), not just a virtualenv. So it seems to me like cross-compiling is something the build system should know about first. It needs access to some information, like target architecture and Python version and properties (32/64-bit, etc.). And then other tools like auditwheel may need such info - but that’s a separate step, which depends on the type of artifact being produced.

That depends on the build system. Compilers may simply be determined by setting the CC, CXX, FC environment variables for example. The build system may not even be written in Python; the arguments for what is and isn’t possible really do seem setuptools-specific.

I don’t know exactly what the ideal solution would look like, but assuming only a virtualenv, a wheel and a setuptools-like build system seems limiting.

This makes perfect sense. It’s something I wasn’t thinking about when writing my first message and the rest of this one though, because to me this (how does a user invoke cross-compilation) is the easier part of the problem compared to letting the build backend actually handle cross-compilation well. For NumPy, SciPy and the like we’re still at that stage - as of today it mostly doesn’t work at all. So I don’t have a good sense of what needs to go in the proposed config_settings, but I’d be curious to see a proposal.

davidhewitt · September 6, 2021, 4:25pm

PyO3 and setuptools-rust contributor here. Thanks to LLVM and cargo, Rust has very good cross-compile support built in, so we get a lot of users attempting cross-compilation of Python projects using PyO3.

The hardest part is figuring out the Python build configuration which we’re targeting. This is primarily:

Python API version (3.7, 3.8, 3.9 etc)
implementation (we support both CPython and PyPy, for example)
the name and location of the shared library (typically libpython3.8.so or similar, however in theory the Python distribution is free to name this as desired)
build flags such as Py_DEBUG or Py_TRACE_REFS which affect C struct layouts, library name etc.

Note that cross-compilation isn’t limited to extension modules; we also often see users who want to embed Python into Rust programs.

For the most part, when compiling for the host OS we can run the Python interpreter and get everything we need from sysconfig. (Although sysconfig is a bit lacking on Windows.)

It’s not possible to run the target Python interpreter during a cross-compilation, so we resort to getting this information however we can. For Unix targets we can get this from the _sysconfigdata file with full knowledge that we accept responsibility if this approach breaks. For Windows we don’t even try; we just hard-code a typical Windows configuration and assume that’s correct. Users can supply their own override if what we deduced is incorrect.

In summary, we’d be really interested in having an officially-sanctioned way to configure cross-compilation.

It seems to me that agreeing on a metadata file (say, JSON or TOML) which contains a basic set of information about the Python distribution at hand would be a useful starting point. Build frontends would be able to write their own integrations on top of this as suited them.

benfogle · September 6, 2021, 5:36pm

As much to get my own thoughts organized as anything else, I went ahead and made a rough draft just to see what a proposal might look like. See https://github.com/benfogle/peps/blob/master/pep-9999.rst

I think this addresses most of the concerns here without overly expanding the scope of the proposal. I’ve done my best to consider Windows, Linux, and other systems. I’ve tried to consider languages other than C. My hope is that it is independent of setuptools vs Meson vs other backends.

If this looks like a viable plan, I would be willing to get started on the formal PEP creation process.

FFY00 · September 6, 2021, 6:44pm

Have you considered adding a custom hook for cross compiling, instead of hijacking config_settings? I feel like it may be a better option: no backwards incompatibility, backends would have to specifically opt-in if they support cross compiling, and we could have a standardized set of arguments in that hook for common options, if needed. We could recommend falling back on build_wheel, just like we do for prepare_metadata_for_build_wheel – if cross_build_wheel (optional hook) is not available, fallback to build_wheel but only consider the resulting wheel valid if it is pure.

I can make a PoC for https://github.com/FFY00/mesonpy and https://github.com/pypa/build, if you want.

steve.dower · September 6, 2021, 6:51pm

Yeah, that’s the kind of thing I had in mind (plenty of details to argue about, but they’re not important just yet ).

I think the most realistic thing we can hope for is a standard sysconfig “dump” tool that can be run on the host platform to produce all the information needed to build on the build platform. We could start including it in CPython as a static file, but that doesn’t help with past releases so we’d need it anyway (to create a database of common settings in a library, or for users to generate their own to override with).

I certainly have, and unfortunately it’s a bit embarrassing that we can’t actually use config_settings for anything (it’s not “hijacking”, as we’re literally passing configuration settings to do with building a wheel). Because you’re right that backends without support will go ahead and build for the build platform rather than the host, with no real way to detect it.

We probably should have required backends to fail on any unrecognised keys in config_settings (and that still may be viable, as it’s pretty easy to add as long as there are no config settings). It would save frontends from having to do the double-pass logic, as well as having arbitrary rules for whether the build_wheel result is portable.

FFY00 · September 6, 2021, 7:14pm

config_settings purposely did not specify any semantics for its contents, as it’s supposed to hold backend specific configuration. At least this is what I understand… I don’t think it was meant to hold standardized data such as what is being proposed here, so I actually mostly agree with the way config_settings was defined in PEP 517

Do note that we could still potentially add new arguments to the build_wheel hook. I think it would be reasonable to have the frontend introspect its signature and check what is and isn’t supported, though admittedly it’s not the cleanest solution.
But for this use-case, I think a new hook is something that makes sense.

steve.dower · September 6, 2021, 7:19pm

But how are you meant to put config settings into it? Do they get lifted from pyproject.toml in some way I haven’t heard about?

My understanding is that it was arbitrary user provided values, which is kind of what we’re just trying to organise here. No need for the front-end to do anything other than pass through the config, we just want to make sure that users have a consistent set of config that backends will be able to use.

Yeah, I’d rather have a new hook than make frontends do introspection. It’s easy enough (as a backend) to implement one hook by tweaking arguments and calling another.

benfogle · September 6, 2021, 7:29pm

In that case, we’d at least be able to open a ticket with the backend saying “Add support for PEP XYZ”, which is a big step forward from where we are now.

I’d wonder about projects like cibuildwheel which try to provide a single environment for many different packages. If a Rust project rejects CFLAGS related config data, but a C project requires it, then there’s no way they could provide a single set of config settings that “just works” with most packages. Maybe this argues in favor of a hook rather than config_settings.

steve.dower · September 6, 2021, 7:46pm

I had expected the top-level setting to be something like host or compiler, rather than a specific setting, but the suggested method of setting these values in PEP 517 votes against that.

How would a backend building a Rust component get its equivalent of CFLAGS if it can’t infer it from a CFLAGS value (for example, obtained from sysconfig)? At a high level, being able to override sysconfig variables seems sufficient, but again, without convention, the config_settings dictionary is basically useless for any purpose.

FFY00 · September 6, 2021, 8:22pm

build 1.0.3, see --config-setting/-C.

An example use-case for this that I am implementing in one of my projects is to allow distro packagers to use the system libraries in the wheels – by default I will bundle the required libraries, producing a wheel suitable for distribution on PyPI, but this behavior is undesirable for distro packagers, as a distro packager I want the wheel to link against the system libraries instead (eg. usage python -m build -Cbundle-libs=False).

Setuptools also uses this mechanism to receive arguments it could previously receive via the setup.py invocation.

setup.py:

python setup.py bdist_wheel build_ext -D"BAR=Foo;VAR=TRUE"

pip:

pip wheel --global-option="build_ext" --global-option="-DBAR=Foo;VAR=TRUE" .

pypa/build:

python -m build -C--global-option=build_ext -C--global-option=-DBAR=Foo;VAR=TRUE

Note that the implementation of config_settings here is not the greatest, but yeah.

What would we gain from doing that via a config_settings dictionary instead of the function arguments? IMO the difference is that one would expect the function arguments to have the same semantics between backends, while one would not necessarily expect that from a config_settings-like argument.

Anyway, PEP 517 does explain the motivation.

steve.dower · September 6, 2021, 8:32pm

We wouldn’t have to update frontends if it’s all done through the existing mechanism. One consistent option to specify the target platform (sorry, “host platform”, which to me is always going to sound like the one we’re building on…) means when the backends have support (which they need to do anyway), we can pass the desired value through all packages.

Backends are of course free to ignore agreed-upon standards, and then developers will just avoid those backends because they don’t work for their users. I don’t think there’s a need to technically enforce it through code in a hook, provided there’s at least something documented that enough of us have agreed upon. In core Python terms, this is an “informational” PEP rather than “standards track”.

FFY00 · September 6, 2021, 8:45pm

Fair enough.