What would packaging look like if we designed it from scratch today?

Forget that setuptools and all other packaging tools exist. Let’s assume for a second that all we had was the PEPs we have written and even those could be changed if necessary (although it would be discouraged unless found to be really necessary). What would want from packaging to be useful to us today and what are we missing to make that dream come true?

Here’s a rough outline of what would be necessary from my understanding to both create a package for distribution as well as to work with it locally while developing:

  1. Read pyproject.toml to find out how to run the build tool
  2. Build tool builds extension modules and other non-Python code
  3. Appropriate metadata for the wheel gets generated
  4. Built extension modules, Python source, and metadata files get put into a wheel
  5. Read wheel file for required dependencies
  6. Repeat the above steps until down to the leaf nodes of the dependency tree
  7. Install all the required wheels

OK, so we have PEP 517 on how to execute the build tools. What is required of those build tools?

  1. Know how to build extension modules if there’s C code
  2. Know how to build anything else that’s unique (e.g. if you want Cython to be run as part of this instead of as a separate step outside the build tool)
  3. Know where/what those built artifacts are
  4. Know where the Python source code is
  5. Know all the required metadata for a wheel
  6. Be able to make the metadata files for a wheel
  7. Where to place files within the wheel
  8. Package it all up in a zip file with a .whl extension name and appropriate tags

So how do we do those steps? I would argue building C extensions should be the job of a C build tool. The rest of steps is just metadata you should put in pyproject.toml somewhere and how to get the appropriate files from that C build tool so they can be placed as appropriate in the wheel after knowing how to execute the C build tool. Could we go so far as to take meson’s approach of generating build files once that you directly execute over and over again and orient our build tools so they generate the C build tool configs (e.g. a build tool generated meson config files that people can execute directly post-generation instead of having to use the Python build tool to drive that every time)?

After that what does the above lack that setuptools has which people legitimately want? I can think of entry points and editable installs. For entry points I would personally redefine them as an option where you define the executable name and all it does is rename your __main__.py in your repository to that executable in the wheel so that there’s no need to generate code and it’s easily supported by all install tools (it basically becomes a file rename in the wheel and the install tool just slaps on the #! when it ends up being written to disk).

As for editable installs, that seems to require:

  1. Running the C build tool
  2. Copying the extension modules to the right place among your Python source code
  3. Creating a .pth file that updates sys.path appropriately (although I personally hate .pth files and want to come up with a better supported solution, but one thing at a time)

Maybe I’m missing something but that all seems tractable. If I’m not off then really the key part that’s missing is PEP 517 somehow separating C compilation and the identification of where the extension modules are and where they should end up as their own step so the overall build tool can copy those files to the appropriate place and let the .pth do its thing. (The other bit is running the installation tool for your dependencies but that can still be done by the build tool or even not by the build tool but by a normal installation tool that knows how to read pyproject.toml if we standardize on the metadata specification more so dependencies are universally listed there).

Assuming the very rough outline above is not totally bonkers, some questions from me:

  • Am I nuts in thinking that we can simplify things by punting more to C build tools so for Python-specific tools its more about identifying where the C build tool put the .so files?
  • Would we standardize the universal metadata of projects in pyproject.toml if we didn’t have the history of MANIFEST.in, setup.py, and setup.cfg?
  • If we still wanted entry points could they be as I described above?
  • Could editable installs be more about breaking the build steps into a more fine-grained fashion so that it’s more “evoke C build, copy .so files to the right place, spit out a .pth file” (and if we did universal metadata just use your installation tool to install your dependencies from pyproject.toml on your own)?

What am I missing? I feel like this is all tractable which honestly scares me :wink:, so I feel like I’m missing some really key thing here in the fundamentals of packaging which will make this way more complicated and harder to potentially accomplish (remember backwards-compatibility is not a concern in this exercise; this exercise is about what we want packaging to be).

6 Likes

3 posts were split to a new topic: How do we get out of the business of driving C compilers?

Entry points aren’t just for executables; they’re a generic plugin database. Also on Windows, install tools have to actually generate .exe files, not just stick a #! line into a .py file. And even on Unix, I think renaming __main__.py files would cause trouble with relative imports?

Entry points are just some extra package metadata though; I don’t think there’s anything special about them compared to other metadata.

That’s basically the idea behind scikit-build, mesonpep517, enscons, etc. We can’t do this in setuptools because all the details of its build process are basically public API (blame Hyrum’s law, exacerbated in this case by setuptools having very leaky/inadequate abstractions).

Hard to say. In fact we do have backcompat constraints – one of the design goals in PEP 517 was to be able to continue supporting existing setup.py-based projects, and they use arbitrary code to generate the metadata, so that had to remain possible. Even without that’d, you’d probably need some kind of escape hatches, e.g. some people really like generating their version field from VCS metadata, and we don’t really know which escape hatches would be needed because we don’t have much experience.

Maybe! You need to be able to somehow handle tools that prefer out-of-place builds (keeping a pristine source), and packages that want to do more unusual things like build Rust/Go/Swift/Fortran, generate .py files at build time, or other arbitrary artifacts that get included into the final package (e.g. data tables in some specialized format), and you need to figure out how to communicate the package layout info between the frontend and backend (right now we use wheels as our data format for this, but that doesn’t directly work for editable installs). But that doesn’t mean it’s impossible, just that it needs to find the right level of abstraction.

1 Like

Whatever this ideal new packaging system winds up being, it should be capable of doing what setuptools_scm does. It’s not for everyone, but when you can make it work for your package, it is really nice to be able to store versioning information in the git tags and nowhere else.

I moved posts on the topic of extension building to its own thread: How do we get out of the business of driving C compilers?

1 Like

My exposure has just been as a way to make executables, hence my suggestion. :slight_smile:

Right, so I’m arguing we should go down that road even more.

I assume so, there’s just too much old code only available as an sdist that uses setup.py to ignore it.

Well, you can do the simple baseline and then have a huge escape hatch for metadata generation. I would argue, though, as a community we know which ones people generate dynamically and which ones they tend to write out statically.

Right, I would assume that would be part of the API to communicate what needs to go where. Maybe specifying how it would land in the relative paths in the Python code which could be used to both place in-place for editable installs as well as when copying files into a wheel.

Yes, if we look into designing an alternative backend that doesn’t directly handle compilation, that is a tractable problem as you’ve described.

escons showed that this is a reasonable approach for Python packages. cargo is an example of a “popular”/“default” tool that uses a similar mechanism.

Dropping backward compatibility is very… “freeing” for such an exercise. I do think that you’ve covered most of the build-backend-related bits, in your OP here.

I have a draft blog post (somewhere in my laptop), that notes all this stuff. I’ll try to go find it and polish it up for publishing [1].

I remember covering the build frontend, simplifying environment management, what new standards could enable and more.


I do feel that I should flag – we shouldn’t decide to act along the lines of PEP 426 as a result of this thought exercise here (rewrite a big thing!) but rather try to minimize the “diff” from where we are and chart an approximate a course to do that.

[1]: Well, that’ll be sometime next week because college. :upside_down_face:

1 Like

Agreed. In this case, I think that backward compatibility has some specific issues, because the scope of the current codebase has grown “organically” and in directions we probably would not have accepted if we had known what we know now. So I see the main advantage of (temporarily) ignoring backward compatibility as being a chance to re-focus our view on scope.

The advantage of thinking in terms of new tools that solve parts of the problem is that we still have the existing tools to cover the current scope. Making the old tools take advantage of the newly written code then becomes “simply” a refactoring exercise rather than a major break.

The result would be a crop of Unix-like small tools each “doing one thing well”, plus something like pip that orchestrates them, and in addition acts as a container for a whole load of ugly, legacy code to handle the use cases that we only support for compatibility reasons (and would not recommend for new code).

1 Like

If the source does not declare version information then it may be passed along from the front-end to the back-end, but the back-end itself should not consult a VCS; when building the VCS information (e.g. .git should in fact be removed for reproducibility.

1 Like

I don’t understand the distinction you’re making between “front end” and “back end” in this context, which is probably because all of my experience of Python packaging is with plain old setuptools and I’m actually much more familiar with C library packaging (using autotools) than with Python packaging.

Having said that, I absolutely agree that builds for public distribution should start by creating an sdist tarball (or moral equivalent) and then everything else should work from that rather than directly from a VCS checkout, and the tooling should make this more natural. I actually spent a chunk of time last week hacking together code to simulate automake’s make distcheck in setuptools and it was Not Fun. I had to monkey-patch two bugs (bpo#3863 and setuptools#1893) on top of all the usual issues with MANIFEST.in and friends. (If you’re curious, see build_and_test_sdist in my CI driver script, and src/setup_hacks.py for the bug workarounds.)

1 Like

Right, and that was empowered by PEP 517 and PEP 518.

And I personally would like to try and focus our scoping on things like PEP 517 and 518 that help open up opportunities since as you point out:

Exactly! I think PEP 517/518 has shown that if we design the APIs so we focus on the output we can allow flexibility in tooling on the input (e.g. some people saying “I want my version number from my version control system”; OK, we can design APIs so the back-end tells us the version number in whatever way it wants while the front-end just takes it and packages it up in the wheel).

That’s my personal, over-arching dream here. Everything is in a PEP with a package on PyPI that implements it. Then we can then work on the workflow to have an opinionated view but also flexibility because we have PEPs specifying what all the APIs for communication are and other PEPs specifying the resulting output.

The key question is what should the scope be at this point and thus what should we be focusing on next?

2 Likes

Think of setuptools, enscons, etc. as the backends and pip as the front-end; the front-end orchestrates making a thing while the back-ends actually produce the thing (in this case pip orchestrates with setuptools to make a wheel).

My personal feeling is that we should take a break from looking at source builds and focus on the wheel install end of the chain, going from “please install such and such requirement” and taking that through to having the right packages installed in the target environment, but only considering wheels.

There’s a lot of relatively low hanging fruit here:

  • The work you’ve been doing on compatibility tags.
  • Standardising and packaging resolver semantics.
  • Defining “target environments” in a way that covers site-packages, user installs and virtual environments, as well as things like custom --target style locations. This should cover
    • Allowing upgrades and uninstalls with --target (in pip terms)
    • Allowing frontends to run standalone, not having to be installed in the target Python environment.
  • Standard configuration data (network proxies, for example), maybe even a standard cache format (so we don’t have multiple http caches of chunks of PyPI on the user’s machine).
  • Building an “installer bootstrap”, that can be used as a “mini pip” to install other frontends, freeing them from having to deal with self-bootstrapping and vendoring everything.

Basically, while I understand how much the source build problem matters for developer builds, and as developers it’s easy to see that as the priority, I think we should spend some time focusing on the main end user experience, which is basically installing a load of wheels from PyPI.

2 Likes

Will the new pip resolve grant cover this?

What about making sure all of the steps are covered in a library somewhere? Otherwise I’m afraid this will all continue to fall on the shoulders of the pip developers (e.g. if someone wanted to tackle the “platyput” tool then having things in libraries would make that much easier for them and take the pressure off pip to be that tool).

I think that it’s very likely that we’d put the resolver’s algorithm and (at least) some of the plumbing+helpers around it into a dedicated library, that’s reuable.

There are a few abstraction layer designs floating around for this, all designed since the folks who’ve worked on the dependency resolution problem want the resolvers to be reusable across multiple tools. It’s likely settle on one of these abstractions (or pick one and improve/adapt it), and then do the required work to use it in pip.

1 Like

IMO, a strong case against doing specifically this is made in Vendoring Policy - pip documentation v24.0.dev0.