"Modern" way to build packages with C (Cython) extensions

ofek · April 23, 2022, 3:25pm

Will be released in a few days: https://ofek.dev/hatch/dev/config/build/#explicit-selection & https://ofek.dev/hatch/dev/plugins/build-hook/#build-data

pitrou · April 25, 2022, 10:16am

I’d also like @henryiii 's input, especially in the light of his proposal at Scikit Build Proposal - ISciNumPy.dev

As one of the maintainers of PyArrow, I’ve found integrating CMake + Cython + setuptools to be quite painful.

henryiii · April 25, 2022, 2:06pm

I’d really love a great, CMake-backed way to build - and that’s what I proposed working on in the proposal linked by @pitrou above. I don’t think it’s that hard, but to do it properly, it needs great documentation, and reusable components (I’d be happy to help make a programatic wheel interface as part of it, for example )

I’ve been helping keep scikit-build alive and well, but it’s currently just wrapping setuptools heavily and is very fragile, and doesn’t cover some important uses (like editable installs) very well. It doesn’t suppose setup.cfg, much less pyproject.toml configuration, etc. It does have a Cython CMake file that I know some people are using (the built-in CMake files need work too; I mostly use pybind11’s excellent files instead). It also doesn’t support modern FindPython. It does have a number of users, though.

The current best “docs” I know of are the pybind11 and scikit-build examples. Pybind11 has a pure CMake and Scikit-build examples, along with a pure Setuptools example.

For the part after the setup, then the Scikit-HEP Developer Pages have good examples of using cibuildwheel, setting up CI, etc. And a really cool in-browser Pyodide app that checks a GitHub repository against the style guidelines. (Unrelated, but really cool - did I say that already?)

I’ll be talking about building binary packages at PyCon this week, that might be helpful too. But the ideal case is that I’d be able to write scikit-build-core.

Scikit-HEP Developer Pages
pybind examples (three repos ending in _example)
scikit-build-sample-projects
ISciNumPy (my blog with useful links at the top)
mesonpy

pitrou · April 25, 2022, 2:18pm

What are you missing for that? Perhaps the PSF or some other body can help?

henryiii · April 25, 2022, 2:40pm

Currently waiting on the response from the NSF to see if the project gets funded. If it doesn’t, I’ll try to work out some alternative way (this is already the second attempt, and I already have a separate proposal locally that might be able to increase my time on scikit-build too). The actual scikit-build-core part I probably could do on the basis of use within HEP (we have at least three packages that would benefit from this), but all the important surrounding work (docs, integration with other systems, trying it out on multiple projects, integrating it with the existing scikit-build users, updates to CMake for better communication, Cython, etc) will take more time than I can ask for from IRIS-HEP, which is why I had a proposal.

ofek · April 25, 2022, 2:43pm

FYI for Submit topic proposals for the Python Packaging Summit 2022! - #4 by CAM-Gerlach my topic is removing dependence on setuptools for extensions i.e. a new library that all backends could use

CAM-Gerlach · April 26, 2022, 6:35am

I believe @henryiii will be presenting it for you

henryiii · July 15, 2022, 3:08am

FYI, I’ve let the cat out of the bag today at SciPy, so will share here too: Scikit Build Proposal - ISciNumPy.dev has been accepted as of yesterday as NSF 2209877 - I’ve been funded to work on Scikit-build for the next three years.

ChrisBarker-NOAA · July 22, 2022, 12:20am

Hmm – seems I wrote this ages ago – and forgot to post it. But I think it’s still relevent:

I’m quite excited to watch scipy’s progress with meson. For meson-python though, I have to wonder if it wouldn’t be, in the long term, a less painful road to have entirely external binary builds,

I’m also excited about meson-python – but not sure about the “entirely external binary builds”.

Waaayyyy back in the day, I was very, very, pleased to discover the dist_utils extension building capabilities – have one way to build extensions that were compatible with the python binary, that worked on all platforms is absolutely amazing! It turns out that the architecture was very hard to extend and customize, but it performed a really great service – for the “basic” stuff, it “just worked” – and we still don’t seem to have a quick and easy cross platform way to do it.

The fact is that there are a number of us that want our packages to work on all platforms, but don’t have build expertise in more than one or two

So hopefully meson-python will fill that niche.

CAM-Gerlach · July 23, 2022, 1:35am

Well, there’s also going to be @henryiii 's Scikit Build as a good option, especially once he completes his work that just got funded. And of course, the Scikit-HEP packaging tutorial, repo checker et al. are a great asset too and one I highly recommend to the many folks I run into in your shoes.

rgommers · July 23, 2022, 9:56am

I’ve worked on PyTorch for a couple of years, including build & packaging stuff. I doubt that this has to do with the thin shim in setup.py - the problem is simply that PyTorch is an extremely complex package itself and has ~30 submodules in third-party/ with also complex C++ code.

What packaging for other build systems typically does is (a) build a wheel, (b) unpack the wheel, and (c) repack it into the correct package format after doing what the tooling needs to do to make everything relocatable, strip debug info, etc… The wheel step is not very interesting, it’s just a convenient interface to be able to do pip install . --no-build-isolation instead of figuring out for every package what the right setup.py / cmake / meson / ninja invocations are.

+1 this is lacking pretty badly.

@henryiii congrats! Looking forward to collaborating to make building packages with native code easier:)

I’m not even sure what this means. meson-python is a thin shim layer; Meson does all the heavy lifting. There’s a lot of little annoying things one has to do to comply to packaging standards (like wheel tags tend to be a cause of bugs). The most interesting parts of meson-python are probably the UX ones, like “if I use pip or build, how do I pass along custom build options” or “how do I express the ABI compatibility constraints for depending on numpy in pyproject.toml in a sane way”.

ChrisBarker-NOAA · July 25, 2022, 8:50pm

Thinking about this a bit more – maybe some lesson’s from conda:

IIUC, what conda-build does is essentially:

create a clean isolated environment in which to install the package
take a snapshot of all the files in that environment
run whatever third-party install processes / scripts are specified
compare what now in that environment with what the snapshot
everything new needs to be bundled in the package.

Lots of complications, of course, but the basic idea is remarkably simple.

What’s nice about it is that it makes a very clean separation between building the code itself, and building the conda package, and thus it can be used with ANY code-build system.

I think wheels could be done in a similar way – building the wheel would be completely separate from building or installing the package. – that may help with the “assembl[ing] wheels from a disorganised set of files” issue – you would never try to do that – you would only assemble wheels from the files that had already been completely installed into a python environment – i.e. a well-organised set of files – but the wheel code would never need to understand that organisation.

steve.dower · July 25, 2022, 10:02pm

This is kind of the world that everything grew into though, and we’ve already grown out of it

Linux packages generally make && make install (or some variant), because you build them for use on the current machine. Python packages started the same way, and then someone clever realised you could capture the installed files and just copy them around. If you dig into the commands in distutils then you’ll see a lot of evidence for this being the original model.

Conda came later than distutils, but was still basically in a world of installing Linux-intended packages into a prefix (standard term for all of these systems) and then collecting them up to redistribute.

These days wheels are far closer to the Windows model (and arguably the macOS model) of explicitly generating an installation package, rather than inferring the package from the installation. When the goal of building is to distribute it, this is the better approach, so there’s no reason to move back towards the old style.

Of course, the real issue is that most active developers in this space are on Linux-like devices, where you can’t reliably build a redistributable native module first time. You have to start with a concrete, just-for-my-device build and then tweak it later. Conda also applies tweaks to binaries in order to make them relocatable. (As another example, Docker handles these tweaks at the syscall level rather than the user-mode binary level, which allows existing build processes to be used unmodified.)

manylinux is basically our way forward here, though honestly I haven’t successfully used it ^[1] Being able to trigger gcc in a way that knows to compile a binary that will work with manylinux is the big challenge. Especially given it also has to work with libraries and code that has never considered manylinux before. ^[2]

So as important as the assembly from file layout would be, it’s not actually the biggest issue we face. Even with tools to do it, there’s a chain of other tools that don’t support the approach, and most of those are outside of “Python packaging” control. There’s a big porting effort, which is likely unable to be fully compatible with the “old world”, to really make it feasible to declare a “modern” way to build packages including native code. Getting there likely involves both users and developers “switching” in some way, which means a fresh new platform^[3] has the advantage. But existing platforms have a lot of inertia.

tl;dr: generating a package from what the build script installs doesn’t really help.

Virtually all of my packages are Windows-only, because that’s my focus, so I haven’t built any native code for non-Windows platforms in a while. ↩︎
The same applies for “code that has never considered Windows before”, etc. We’ve gone from a very narrow context to a very inclusive one, and mostly haven’t done the boring porting work to actually enable it. (Though that work becomes very exciting if you want to preserve compatibility… spoiler: you eventually can’t.) ↩︎
e.g. WASM ↩︎

rgommers · July 26, 2022, 10:15am

Maybe this is just an abstract philosophical point, but conda was designed as a cross-platform tool from the start. Chris has a good point I think, the conda-build model is indeed cleaner and much more reusable than the many ways of creating a wheel directly.

There’s a problem though, aside from the lack of separation between build/install and “create a package for distribution”: tools don’t know whether you’re intending to distribute something or not. I can’t count the number of times I’ve done pip install .; (arghhh Ctrl-C); pip install . --no-build-isolation. What pip does is a poor default here for local development.

This is an important issue, but certainly not the only one. A key decision to make is “should I vendor these shared libraries yes or no”. The right answer will be different for distributing vs local install also for macOS and Windows. So being explicit about whether you’re distributing or not would be quite helpful.

Making complex packages distributable is very much nontrivial and has tradeoffs, so pretending that no such differences exist is unhelpful in general.

There are also things that can be done to make both the current model for building wheels and an “assemble from files” model better. One that comes to mind immediately is platlib vs. purelib. This is a constant source of friction for no good reason - the concept is completely obsolete, and removing the distinction would simplify a lot of build/packaging tool implementations.

steve.dower · July 26, 2022, 11:34am

I think it’s just a familiarity point.

When you start (as I have) from the POV of building in one place, then packaging, then installing, it seems very strange to build, then install, then package. But if your background is systems that primarily build, install, then package, then the alternative definitely seems weird.

The issue is that when installing comes before packaging, your tools make all kinds of assumptions about only having to install on the current system. We see these flow throughout Linux and virtually everything that’s ever been built on it.

There’s nothing wrong with this, to be clear, it’s just that wheels have moved the goal posts. Now the focus is on packaging/distribution, rather than installing, and so if you put installing above packaging then you force a whole lot of issues into the packaging stage e.g. whether to vendor or not, or “tools don’t know whether you’re intending to distribute something or not”.

Just as an example (there are others), Windows doesn’t care whether you’re intending to distribute something - the philosophy underlying Windows development is that you’re always going to distribute it. It’s actually a bit of a pain if you just want to build and use something locally, because you still have to distribute it to yourself, but when you want to share a build with others it flows more naturally. Most of the Windows-specific tools grew up under these assumptions, and so they have different designs from Linux tools.

Again, being a bit philosophical, only a Linux-focused developer thinks this is a key decision A Windows developer thinks “of course I vendor these shared libraries, because that’s the only way my user will get them” - it’s barely a conundrum (though of course you still get all the issues associated with DLL hell, instead of the alternate world of apt-install-hell).

My main point is that wheels have shifted our platform so much further away from a Linux-like model that Linux-like assumptions about building don’t serve us well anymore. Neither do any other assumptions from any other platform. The “modern” way doesn’t exist yet, we have a whole lot of patches on top of the old ways to make them work. And I expect a new modern way will grow up around a platform like WASM that none of us are currently using to build things, as only then will the target be separated enough from the build environment to let us actually ensure things are redistributable.

newkozlukov · July 26, 2022, 2:09pm

You’re ultimately right, although the additional layer of path and environment manipulations in setup.py and cpp_extension.py isn’t exactly helping with distribution either.

I think what you’re talking about is that the development and the distribution are, generally, different processes carried out by different people. Upstream developers might be making their own assumptions, which might be complicating distribution for particular platforms… I think we risk to seriously deviate from the subject if we talk about this in terms of “*nix vs windows distribution model”

Hm? They don’t seem too different from, say, .debs or any other package formats: they’re archives with metadata, following some agreed-upon tree structure (like FHS or like python’s site-packages), making some choices about which bits to ship, and which to assume. In particular, …

…I thought of manylinux wheels just as of archives with build artifacts expecting a runtime with libc newer than some particularly old revision from a particularly old centos. Unless I’m terribly misguided, they are just “a” choice of which shared libraries to vendor, and which “system” libraries to assume

Just as a constructive proof of existence: Nix handles the native build in a way that the result is redistributable, this is not much of an issue. Assembling a valid wheel and ensuring that python’s pkg_resources, setuptools, etc correctly discover the package installed from it is less trivial. This is why I suggested the focus on tooling around wheels

ChrisBarker-NOAA · July 26, 2022, 3:49pm

TL;DR:

Python packages are inherently more suited to the “Linux model” than the “Windows model” – at least as far as building and vendoring is concerned.

Long version:

the philosophy underlying Windows development is that you’re always going to distribute it.

sure – but in the context of Python packages on PyPi, that’s the whole point as well.
…

A Windows developer thinks “of course I vendor these shared libraries, because that’s the only way my user will get them”

I actually think this is the bigger distinction between the platform approaches.

(and note: the Mac is very much this way as well – though it has "*nix "under the hood, so you can kinda-sorta choose how you want to do it, e.g. homebrew)

Anyway, I think this distinction is not so much the how you build / distribute, but what you build / distribute.

That is, Windows and the Mac are all about distributing monolithic applications. whereas the *nix model is more about an integrated system – you distribute all the various libraries, and then applications all share the same set of libs.

Arguably, the monolithic application approach is easier for end users of applications – which is why even on Linux, things like firefox, and especially closed source applications, are delivered as monolithic apps (or they were, it’s been a while )

But Python packages are not monolithic applications – you may build a full application with Python – and indeed many do, via Py2exe, PyInstaller, etc. – but the packages themselves are usually not a full application. In fact, if we are talking “vendoring”, then Python itself should be vendored, yes? (which it is with the “Installers”)

Python packages are in a middle space between, say, C code and a final application. And given that some of us do make and distribute monolithic applications built with Python, we would rather that each package did not vendor everything.

I have build Python applications using packages that all use many of the same libraries – for instance, at least 6 that use libpng (wxPython, PIL, GDAL, Matplotlib, py_gd, ndimage). Add with things like BLAS, etc. for the scipy stack, vendoring is really not the best option. Granted, computers have a lot of memory these days, so maybe a application including 6 maybe slightly different copies of the same lib is not a big issue, but it sure feels wrong to me :-).

And for folks doing, say, data analysis with Jupyter and the like – the whole stack is the application, not each package.

Also – what would one “vendor?” – should scipy be vendoring numpy? should Jupyter be vendoring the whole scipy stack? We’ve clearly already decided that no, that’s not how we want to do it in Python. So why do we make a distinction between a dependency that’s written in python and one that’s written, in C?

This problem is particularly acute in the scipy stack, which is why conda was developed at all - and why it came out of the scipy community.

The fact is that Windows and Mac don’t natively supply a solution to this problem. Which is why macports, homebrew, Chocolatey, and ?? have been developed.

In fact, while conda is arguably designed around the “Linux model” – it really exists because of Windows and the Mac – Linux had already solved the problem. And it’s certainly used more on Windows than any other platform.

Final point: key to conda is that you are not building and installing the package into the developers system – but into an isolated environment. Which makes it a little less like the user-compiles-it-themselves model. The fact is that a build-then-distribute-then-install model requires a well defined target platform as well – an isolated environment is a way to define that in a really clear and testable way.

steve.dower · July 26, 2022, 4:14pm

I think this is another Linux-y assumption, though I can come up with reasonable definitions that fit what you’re saying, so perhaps it just needs more pedantic definitions In my world, “production of the package that will be redistributed” is a development task, even though someone else is going to print the CDs/host the files/etc., and production of the package is the relevant part here.

Chris expanded on this further, but you nailed it with “making some choices about … which to assume”. Wheels only really assume a standard Python install, they don’t define anything else, whereas each Linux distro has its own package format which is going to assume it’s part of that distro. They’re closer to wheel+platform tag, and so “manylinux wheel” defines as much information as “.deb package”.

Agreed, Nix is another good example of this. In particular, it starts fresh and defines a build process that allows a redistributable result. You’re going to see the same thing happen with WASM tooling, and have already seen it with Conda, but they all break compatibility with what came before in some way.

Well it is now Pre-wheels, it wasn’t. But maybe it’s been long enough we can pretend that that’s prehistoric.

This is a very good topic that I deliberately avoided getting into because my posts were long enough already. Thanks for going here, you covered it well

ChrisBarker-NOAA · July 26, 2022, 5:02pm

Well, PyPi was always about distributing packages – it just didn’t used to be about distributing binaries – for pure-python packages, there isn’t such a huge distinction

Which brings up a point – it seems to me that most of the PyPa activity has been focused around “pure python” packages, and then we try to cram more complex packages into that system, which is not always easy. Hence this thread.

The modern systems that make a cleaner separation of concerns should help – we’re getting there.

pf_moore · July 26, 2022, 6:49pm

I’d say that the PyPA have focused on solving general problems (ones that affect all packages, rather than only ones with specific content like platform-specific binaries). Not least because we’re typically not experts in more complex build systems. We invite feedback from specialists in particular areas, but typically don’t get much. This is for good reasons - it’s hard to know whether a general approach will satisfy your specific needs until you get a chance to try it.

When the specialists do get round to trying to make the standards work, hopefully they provide feedback that improves the standards in those areas. That’s how (for example) the manylinux and musllinux standards came about. More of that sort of feedback cycle would be great - we are still lacking in any real expert input around GPU-based stuff, for example. This thread is picking up on the fact that we’re also still lacking in general “building compiled extensions” input - as far as I (as an outsider) can tell, the general principle is still “if you need to build C code, use setuptools” and when that breaks down (builds too complex to fit the setuptools model, languages other than C) you’re assumed to be enough of an expert or special case that you have to find your own way.