Some thoughts about easing a migration path away from setup.py

kknechtel · November 19, 2023, 12:21am

Assuming they don’t type it out, this just kicks the can down the road. Either you need the tool to somehow get it right all the time, or you have to tell the tool what the rules are, and then you need a format for that input. Either way, a manifest like that isn’t solving the problem; it’s just recording the solution (and whether that happens before or after the actual file copy doesn’t matter much).

Aside from the feature request to make and work with a manifest as a separate task, I think you are both on the same page, actually. But while such a manifest file could be validated and such, I’m not convinced that it actually facilitates (re)building a wheel.

I relied on an online repository counting tool, applied too naively - the issue was already pointed out. Sorry about the confusion, anyway my planned feature set is surely even less ambitious.

I checked out extensionlib - unfortunately I couldn’t get a clear sense of how it works, in particular how it actually helps with producing the extension modules - what I saw only seems to help organize the code that would do so. Also, by my reading it would not be PEP 621 compliant to add a [[project.extensions]] array.

However, I do strongly agree with the idea of separating extension building from wheel packing. I just figure the easiest way for the extension builder to communicate the desired location of the build artifact - given that it’s going to operate within an isolated environment that will ultimately hold the wheel contents - is to just put the build artifact in the appropriate place in that environment. I don’t really want to define a separate API for that. Although I guess that avoids coupling to that design decision, for other build backends that want to work differently… ?

Thinking about it some more, I think I must have been. But I will need to understand it in more detail.

This is a fair assessment and I don’t mean to backseat the Setuptools team. The config scripts I imagine would not be compatible with Setuptools, nor my implementation with a current setup.py. Rather, the goal is to take inspiration from setup.py, and design something that people accustomed to setup.py could figure out easily enough, as the next step after moving static bits to pyproject.toml. I’m starting my analysis from Setuptools because it’s the obvious starting point: it’s what Pip uses by default, and what makes all those existing setup.py files work. But I’m neither trying to refactor Setuptools into oblivion nor asking anyone else to do the same; instead, I’m building upward from the example code in PEP 517 (which seems to be exactly what the PEP intended to happen).

Frightening, and impressive. I agree that it would be far better to leave it to those who already have a headstart on the task. But given that, I’m now more interested in how to interface to it. It looks like there is no API and the intended interface is all command-line, and that you basically use it just to produce the necessary artifacts? I see some stuff in the documentation about installing things, but it seems to refer to system-level stuff, so not compatible with the wheel format. (The examples seem focused on standalone C executables anyway.)

I guess there’s also the option of using Ninja directly, but it seems like at some level there’s always going to be some interface layer somewhere that does a subprocess.call etc. to invoke the C-building system.

When I dig through all the layers (Setuptools → distutils build command → build_clib or build_ext command → ccompiler base class (compiler attribute of the command) → _compile in an implementation class → spawn back in the base → top-level spawn function), I do in fact end up at such a wrapper. (I don’t know why I had any doubt I would.) It’s just that all the intermediate layers seem to be trying to implement some part of what Meson etc. do; and you’ve very much convinced me to try to implement any of that.

So now I’m just firmly convinced that I just want to smooth out that step a little bit, which honestly is pretty much what I originally had in mind. People who want to shell out to Meson can install Meson and do that. People who just want to make one manylinux wheel and know exactly what gcc commands they want, can directly use those commands instead. There just needs to be a hook for the right point in the process to do that, and a wrapper for things like logging and collecting errors from each invocation.

ofek · November 19, 2023, 1:16am

what I saw only seems to help organize the code that would do so

Yes, that is literally the purpose and only that. Whether you are using CMake, Rust, or whatever else to build extensions it doesn’t matter, the interface is the same and standardized.

kknechtel · November 19, 2023, 2:50am

Okay, I’m glad I understood properly.

When you said

If “this” can include other designs for the same fundamental idea, then I’m happy to help. But from what I understood of the interface described by extensionlib, I didn’t really like it. From experience, it’s hard to explain this kind of thing, and it depends a lot of subjective personal preferences. I think it will be easiest for me to express my own ideas in code.

ofek · November 19, 2023, 3:06am

Can you please describe briefly not the interface but very high level what you think the components of building Python packages are/should be conceptually?

edit: specifically a wheel, forget about all other possible outputs

kknechtel · November 19, 2023, 6:43am

This is the flow I imagine for building a wheel.

A build frontend invokes the PEP 517 build_wheel hook.
The build backend creates a temporary folder that will contain the files to be packed. (Aside from build isolation, this is the easiest way to handle the requirement that the source folder may be read-only.)
The backend parses pyproject.toml and produces a combined config object from the frontend’s config_settings and the appropriate [tool] table. It remembers the [project] table for later metadata creation.
The backend invokes a “manifest” hook, which is responsible for copying necessary files and folders into the temporary folder - laid out as they would be for an sdist. Normally this will use a built-in hook provided by the backend (which in turn may care about the config), but it can be user-defined for more control.
The backend invokes zero or more “build” hooks, which are responsible for invoking compilers as needed. There can be several that handle separate extensions, or one that oversees the entire process (possibly doing its own imports of helpers), or none for a pure Python wheel.
The backend invokes a “cleanup” hook, which is responsible for any necessary rearrangement, deletion of C source. After this step, the packages for the wheel should be in src/, and certain other subfolders at top level can be used to specify the wheel’s data files. Anything else at top level will be at most used for metadata. The default cleanup hook basically just enforces “src layout”.
Metadata is generated based on any README, LICENSE etc. files that remain at top level. (This is deferred in case the cleanup hook does something especially tricky.)
The backend reorganizes and packs the appropriate folders into the wheel, and (most likely) removes the temporary folder. It returns the wheel’s basename to the frontend, per PEP 517.

Building sdists would be essentially the same for the first four steps. It would skip steps 5 and 6, and have different/simpler rules for steps 7 and 8.

As I understood your idea, the separation here is between step 5 and everything else.

Re-reading this, I realize I didn’t decide how/where the wheel tags are computed. I guess the cleanup hook is the most sensible place for that.

ofek · November 19, 2023, 3:11pm

Thanks! I now understand what you were talking about.

As always I am in favor of 5 since that is the concept behind extensionlib but:

4.) On face value it’s wasteful versus just putting everything in a source distribution, but actually this would be an improvement because many tools build the wheel from the source distribution and therefore an unpacking step would no longer be necessary. I would be in favor, except I don’t think this optimization realistically will be accepted because the standards would have to be updated and every backend would have to change. Since this is just an optimization, I don’t see this happening.

6.) I think this is trying to do too much and is largely unnecessary if we have 5 because the outputs would be known and therefore can be removed. Anything extra should be the purview of build backends and other tools.

kknechtel · November 19, 2023, 9:43pm

Maybe I should have been clearer that this is only the design I’m expressing in my own project.

To build sdists, there has to be some kind of step that decides what goes into an sdist. I expect that almost everyone will be able to use the default, but it’s a clear separate step in my design so I might as well expose the hook. Aside from that, once we already have the decision to copy files to a build folder, “everything laid out as it should be for the sdist” seems to me like the most natural starting point for a wheel build.

I see the opportunity there for an optimization, but I’m not trying to push it on others (at least, not yet). Many other toolchains want to verify explicitly that the sdist can be unpacked to build a wheel, and indeed that’s what build does by default. In fact, since PEP 517 doesn’t specify an interface for building both at once, I could only take advantage of the optimization by exposing a config setting (a flag for build_wheel that means to pack the sdist as well), and then I couldn’t communicate to the frontend about it. So, doing it properly would take a new PEP, and I don’t know how well that would be received.

In terms of cleanup, maybe it won’t be necessary in general, but again I am just exposing a hook in my own design. But my thinking is that someone might want to write per-extension hooks that leave the .so files etc. in the simplest places, and then a single overall hook that figures out where they go. Or maybe they all need to be linked together at the end somehow.

There’s also the issue about wheel tags, which has to happen somewhere. Maybe the build system is responsible for figuring out what the platform is, which then determines wheel tags. Maybe it had to do something different to target different Python versions. I guess this is something where I’d have to talk to cibuildwheel users to get a better idea.

oscarbenjamin · November 19, 2023, 10:15pm

What matters is that it should be easy to understand when something of potential consequence is being changed like files being added or removed from sdist or wheel. If you have a VCS-controlled manifest file then it is very clear when the contents of the release artefacts are being changed: all changes are explicitly visible in the diff of any pull request whether that means changing the contents of the files or changing which files are included.

What also matters is what it is exactly that can be made standard. The different tools like setuptools, poetry, hatch etc have all made opinionated decisions about how to specify the configuration of which files are included in sdist/wheel and it seems unlikely that we could get them to agree on a single standardised approach for this configuration. What can be standardised though is a very simple manifest file format that makes no implicit or opinionated decisions and that any tool can easily output or consume.

kknechtel · November 19, 2023, 10:27pm

Indeed. In fact, “I prefer tool X’s opinionated decision” sounds like one of the main reasons someone would choose that tool. Part of the point of PEP 517, as I understand it, was to enable that kind of expression.

To be clear, do you imagine that there would be tools that produce a manifest but don’t build a wheel? And tools that expect the manifest file to exist rather than using their own scheme? (Or perhaps they’d offer a switch to override their [tool]-specific config with the manifest… ?)

I can see value in that, but I’d be opposed to mandating that any particular toolchain supports such a flow.

I guess the format is not quite as straightforward as it sounds, so there would be some point in standardization because there are actual decision points. At least, I can think of one: how to represent folder structure (either with some hierarchical organization - maybe involving indentation - or else by explicitly giving the full path for every file).