Python Packaging Strategy Discussion - Part 1

I didn’t want to split hairs about “pessimism”, but it seems the tone of my message came across badly, because it wasn’t pessimism, and certainly wasn’t doom-and-gloom on my part.

While one could say I’m putting the finger in an old wound, I wanted to surface:

… because this is IMO the decision that eventually needs to be made on a language level. Invest resources into fixing some very thorny problems, or explicitly declare them out of scope[1]. I think the world would be able to adapt to either decision, but obviously I’m in favour of doing the effort and achieving some degree of unification (I also don’t doubt that it can be done technologically), so if anything, I’d say I’m verging on optimism. :slight_smile:


  1. Otherwise we continue limping along on volunteer enthusiasm, which – while substantial – is dispersed into various pockets & niches, and is unlikely to organically coalesce around a unified goal by itself. ↩︎

1 Like

I considered distinguishing the various degrees of unification, but just left the target at “maximum unification”, along the lines of the survey comments à la I would blow it all away and replace it with one damn thing – perhaps that was naïve. :sweat_smile:

This is a great list, and will be very helpful I think. I can additionally think of:

  1. Unification on an environment model/workflow (e.g. don’t touch the base python install, put everything into an environment)

I’d also split 3. into “build tooling” (how do I turn this source code into a publishable artefact?) and “publisher-facing tooling” (how do I actually publish that artefact?).

1 Like

We’ve partly “solved” that with Marking Python base environments as “externally managed” I believe (it’s still in need-implementation state, rather than implemented state).

Convincing people that there should be no “base” site-packages (i.e. put everything in a venv, no user-site etc) is… not a fight I wanna pick up myself, even though I do agree it would be nice to have! :slight_smile:

I’m not sure I understand how you mean that solves the standards problem, except perhaps as in reducing the number of standards simply by winning through competition as measured through user numbers?

Because while that may be able to take us from 15 to 4-5 standards, the user base of each of these tools is fairly invested in the respective particularities.

The “last mile” to get to one standard is I think only possible by some centralized action – think how mobile phone manufacturers (and especially Apple) had to be more or less forced[1] to agree on a standard[2].

What I was trying to get at is that right now, a user has to ask themself the questions: “should I create an enviroment?” “do I need to?” “how do I do that?” “is it venv, virtualenv, poetry, conda, …?”

And what I meant by unification of that aspect is that these questions should disappear (by being implicitly “always create an environment, or if you have one[3] already, activate that”).


  1. no-one likes to be forced, but I think the situation would benefit a lot from channeling people’s effort into much fewer projects that can still tackle all desired improvements, but with much less duplication of effort. ↩︎

  2. Sidenote: USB-C is finally coming to the IPhone! ↩︎

  3. speaking of an environment in such a unified paradigm, not one of the currently existing ones ↩︎

I think R is the language we can learn most from, given how much overlap there is in terms of the issues encountered at https://pypackaging-native.github.io/. Notably, all the R users I’ve encountered (plus what is used by https://carpentries.org/) do not use conda (unlike Python, where conda is used by the Carpentries). I suspect this is down to how R and R packages are distributed/installed on the various OSes:

  • On Windows, the R project seems to provide a build service https://win-builder.r-project.org/, which includes Rtools (Rtools42 for Windows, see Rtools - what, why, when? - General - RStudio Community for more context), which provides both a msys2 build environment (i.e. GCC/GFortran plus other GNU build tools) and common non-R libraries that packages may depend on (I’m not familiar enough with the geospacial stack to know how complete the packaged list is, but there’s a few astronomy packages there that are core, and Python has practically won the astronomy ecosystem).
  • On MacOS, the project instructs users to install the mac build tools via xcode-select (which is fairly standard), and provides binaries of the other remaining libraries at https://mac.r-project.org/.
  • On Linux, the choices seem to be to use the distro-provided R ecosystem (and installing on top whatever is missing) where possible, and building from source where that is not an option.
  • On “Unix” (the docs don’t seem to provide any guidance as to which specific OSes are included here, but I’d assume BSDs are in-scope), build from source like Linux.

The majority of R users I know are Windows users (non-astronomers in this case), and I haven’t seen any installation issues there (unlike Python, where I know no-one who has used a non-conda, non-cygwin Python setup successfully).

To me, the biggest takeaway from this is there needs to be an easy way to get the required compilers on Windows (these are pre-loaded on CI systems, but I personally have no idea what’s the correct way to do this on a desktop). There used to be a MSVC installer for Python 2.7, could something like that be created for newer Python versions (or even better, come with the Python installer as an install option). I’m not sure if it’s worth pointing users at the Rtools installer, or trying to create our own Python specific version, but having something which provides a basis on which to build scientific packages from source would I think help with non-conda, non-cygwin Python installs (really, non-we’ve-done-the-integration-work-for-you Python installs, which is what system package installers do).

Caveat on the above: I’m not an R-user, so the above is from helping others install/teach/work with R and from reading the docs, so I’m likely missing out issues and subtleties that someone using R in anger would know.

This is more-or-less what Conda is, but the prevailing opinion (or perhaps just the loudest in this forum?) is that it needs to be torn apart to become more like the current “figure-it-out-yourself” ecosystem.

Worth expanding this survey - perhaps the easiest way is to ask Anaconda, who clearly saw enough value in distributing R packages to start doing it, but will also know why they haven’t seen the same success as they have with Python.

I’m working on this, just as I made the installer for 2.7 happen, but this time I’m trying to work with the compiler team rather than playing chicken :wink: Right now, the easiest way to get the compilers is through CI, which is usually free enough for most open source projects, and supported projects (i.e. anyone backed by a group like NumFOCUS) can easily arrange for more.

<sarcasm begins>Of course, the easier way to do this is to force everyone to switch to the same compiler. If we pick one that we can redistribute on all platforms, it’ll make things even easier for users, as nobody will have to use their system compiler or libraries anymore - they’ll get an entire toolchain as part of CPython that’s only useful for CPython and doesn’t integrate with anyone else’s libraries! (I hope the sarcasm is coming through, but just in case it’s not, this sounds like a massively user-hostile idea. But if you want to try this approach… well… it’s any of the Linux distros or Conda/Nix/etc.)


What I think is really the issue is that we haven’t defined our audiences well, so we keep trying to build solutions that Just Work™ for groups of people who aren’t even trying to do the same thing, let alone using the same workflow or environment/platform. @pradyunsg’s list of possible unifications above is heading in a much more useful direction, IMHO, and the overall move towards smaller, independent libraries sets us up to recombine the functionality in ways that will serve users better, but we still need to properly define who the users are supposed to be.

One concrete example that some of us have been throwing around for years: the difference between an “system integrator” and an “end user”. That is, the person who chooses the set of package versions and assembles an environment, and the person who runs the Python command. They don’t have to be the same person, and they’re pretty clearly totally distinct jobs, but we always seem to act as if end users must act like system integrators (i.e. know how to install compilers, define constraints, resolve conflicts, execute shell commands) whether they want to or not. And then we rile against system integrators who are trying to offer their users a usable experience (e.g. Red Hat, Debian, Anaconda) for not forcing their users to also be system integrators.[1]

But maybe we don’t have to solve that end user problem - maybe we can solve one step further upstream and do things that help the system integrators, and then be even clearer that “normal” users[2] should use someone else’s tools/distro (or we start our own distro, which is generally what system integrators end up doing anyway, because “random binary wheel off the internet” isn’t actually useful to them).

To end with a single, discussion-worthy question, and bearing in mind that we don’t just set the technology but also the culture of Python packaging: should we be trying to make each Python user be their own system integrator, supporting the existing integrators, or become the sole integrator ourselves?


  1. If they want to be, they can go ahead and build stuff from source. But that’s totally optional/totally unavoidable, depending on what problem you need to solve. Still, non-integrator users are generally within their rights to say “I can’t solve that problem without this software, can you provide it for me” to their boss/supplier/etc. ↩︎

  2. Those who just want to run Python, and not also be system integrators ↩︎

5 Likes

I’m glad to hear that you’re working on this as it would be useful. While it is easy to get the compilers in CI my recent experience of trying to build wheels in CI is that you absolutely have to get things working locally before attempting anything in CI. One of the difficulties I had was just locating the right version of MSVC. I didn’t necessarily need a bespoke “MSVC for Python” but just a clear “here is the link to the installer file for the one that you want” would have been very helpful. Ideally it should just be the compilers and not the full visual studio with many gigabytes of stuff that I don’t want (the MSVC installer GUI seems to be almost deliberately designed to make it difficult to do this). If some “unified build tool” could just download the right version of MSVC as needed then that would be amazing…

Another point of difficulty is that in my case the base of the dependency stack is GMP which cannot be compiled correctly with MSVC (this is similar to SciPy needing Fortran compilers etc). Being able to standardise on a single toolchain would be great if there was one that handled all cases. I think I could potentially build dependencies with mingw64 (linking against ucrt) and then use MSVC to build the final Cython-based extension module. I just haven’t managed to get that to work yet though because mixing toolchains like that is not something that the existing tooling makes easy (neither is using mingw64 in general but at least I have that part working now).

The Visual Studio installer is what you want, then select “Python” and then “Python Native Development”. I can’t make it any more straightforward than that, and even when I tell people that’s how to do it, they insist on making their own instructions for various reasons (e.g. by “needing” to use the Build Tool installer, which doesn’t have any explicit Python options because it doesn’t have any explicit Python features).

MSVC is the many gigabytes of what you need, not Visual Studio. Since none of the tooling or system libraries or headers are in the base OS, you’re going to get all of them. It’s not a small ask, and even in a simplified model, it won’t be a small download. Hence why I’d rather get publishers putting out compatible binaries, so that users don’t have to go through this themselves.

1 Like

I thought this was an interesting approach to install a C compiler more easily / reliably. But I guess it doesn’t work for everything (e.g. C++)?

It is, but it doesn’t work anywhere you have a specific requirement for a specific compiler.

CPython has historically always been about integrating with the underlying system. This is why we don’t mandate a particular compiler (beyond what’s necessary to generate a compatible ABI) - our users will have their needs, and we want to meet them where they are rather than forcing them to rewrite their entire system in order to fit Python into it. This was the point of my sarcastic paragraph above about not forcing the compiler choice.[1]


  1. You could argue that we “force” the choice by what distutils detects, and I’d argue this is what I meant by we define the culture as well as the technology. Even distutils (which is now removed) allowed you to override the compiler, but the defaults often win with users who don’t have a strong preference (yet). Post-distutils, if someone were to create a build backend that defaults to a different compiler and it gains popularity, that compiler will eventually win. All the tooling is there, it’s just up to someone to build it and evangelise it until it wins by popularity. ↩︎

To answer your question, I believe we should be supporting Python users. For many users, that means helping them to not be integrators. For some, it means helping them to do a simplified integration job. For all users, it means giving them an easy way of determining how they should get Python in a form that meets their specific needs.

  • For users who want to be their own integrator, give them good building blocks. I see wheels in this role - they let you do the integration at the “Python package” level, without needing to get involved with C ABIs, etc. It’s not a complete solution, and for some users wheels aren’t sufficent (because the ABI type problems are too complex) but to give an example, a “good enough” numpy wheel suits this type of user just fine.

  • For users who don’t want to be an integrator, we should both support existing integrators, and encourage the emergence of additional integrators where there are gaps in the coverage (for example, Windows users for whom conda isn’t the right solution).

  • I don’t think the “sole integrator” solution is the right approach. If the SC wanted the core team to be totally involved in packaging, then maybe it would be, but not with the current model where core development and packaging are independent.

But the message we’re hearing from users is “there are too many options”. Too many integrators is just another variation of “too many options”. We see this already in instructions on how to install a given library - which typically only cover pip, leaving everything else to “you’re on your own”[1]. And it’s also implicit in the confustion over “how do I get Python?”

So if we do want to support integrators, we need those integrators to work with us on simplifying the end user picture. A better starting page on python.org leading users through the process of deciding how to get Python, and what integrator to choose, if they want one. A common install[2] command - either implemented in core Python with “hooks” for integrators to link to their own implementations, or a defined command like pyinstall which all integrators are required to implement with the same interface, so that the user experience is identical. Or maybe something else - I don’t know what options would work. The PyPA could develop an implementation of pyinstall for wheel-based “user wants to be their own integrator” situations.


  1. Or in some cases cover just conda, and leave non-conda users on their own. ↩︎

  2. Possibly more than install - something more like Rust’s cargo is what I get the impression users actually want, but it may be too big of an ask to expect all integrators to buy into a common complete cargo-style interface. ↩︎

3 Likes

There are a few things to learn from R - for example, CRAN has a build farm and its checks on upload ensure higher-quality packages, plus users seem to like the integration of package install/management in the interpreter. Overall R is more limited in scope that Python though, and I’m fairly sure no one here wants to go back to building from source by default on some platforms.

Other languages and their package managers have lessons too. Julia’s package manager is pretty powerful and yields a nice user experience - as numerical/scientific languages go, it’s the state of the art I’d say, rather than R. And Rust/Cargo for overall experience.

Loudest in this forum I’m sure. The “torn apart” isn’t a useful statement to make, unless you meant “if the goal is to standardize on a single do-it-all thing”. Which I don’t think will happen. We need more unified UX, interfaces and concepts, but under the hood we very likely can’t get away from multiple build systems/backends/frontends, multiple environment and package managers, etc.

I like @pf_moore’s answer a lot. The “want to be their own integrator” users are important (and over-represented on this forum), and that should continue to be supported. However, the average user doesn’t want to do this, they want things to work well without too much trouble. So I’d also go for supporting the existing integrators better.

Like @pradyunsg, I also have more thoughts than fit in a Discourse post - will go write a blog post too:)

3 Likes

I’m not in the R world so I don’t know if this still holds, but I was one told that this is actually quite the burden because if you don’t compile on some esoteric system your package gets rejected. I would assume we would have tiers of support where you didn’t have to necessarily work on IRIX or something. :wink:

2 Likes

I think describing it as “needs” rather than “wants” would be better (as usually there are constraints/requirements that motivate this choice). I think telling users that using an integrated solution should be their preferred option more explicitly in official documentation would be a improvement (as the Carpentries do, how to best do this without putting particular providers offside is a different issue though). The issue is when the chosen integrator does not provide the package you need, and you either need to get the package via PyPI (which even with wheels is not really integrated like the alternatives, and forces the user to become their own integrator or become a packager for their chosen integrator) or you need to switch provider (and so begins distro wars, angry twitter threads and so forth).

Here’s three examples I’ve personally encountered (I can create a PR to add these to pypackaging-native github doc if people want):

  1. Back in ~2016 a friend came to me (being the unofficial resident expert in the physics department for installing Python-related software) asking me to help her install a quantum physics python package (I don’t recall which one), but it only provided instructions for MacOS and Linux, and she used Windows and did not want to install/learn Linux. I don’t recall all the things I tried (I can’t remember if I got MSVC working for example), but I know one of the things was trying to install conda and use the provided GCC to work with her existing installation (which I think was the official Python installer). I couldn’t get the package to install correctly, and so pointed her at some resources to get started with Linux. If there had been a semi-official “here’s a set of compilers/build tools which work with the Python installer” (which is sounds like @steve.dower is working on for the MSVC side), maybe we could have got the package installed with her original install.
  2. IRAF (https://iraf-community.github.io/) with its python bindings pyraf remains a major part of the optical astronomy toolkit. Before the creation of the astroconda channel (which includes a pre-build IRAF/pyraf and mostly solves this for conda users) and the start of “official:” community maintenance, STScI provided Ureka[1], which contained a pre-compiled scientific Python stack with an included python interpreter (Python 2.7). If you needed packages outside of that set, you needed to fiddle around with LD_LIBRARY_PATH because of how Ureka was built, and you naturally couldn’t use a different package manager (e.g. conda, but also enthought and similar package managers targeted at science users).[2] It would not surprise me if there are other packages out there that do a similar thing to Ureka and effectively force building everything else from source because they haven’t built upon more standard tooling (e.g. conda).
  3. The issues with MPI are covered at Depending on packages for which an ABI matters - pypackaging-native, but h5py also has the issue needing to deal with curl. HDF5 provides an option to read files on S3, but as HDF5 is a C library, it naturally needs to use some other library (in this case curl) to handle HTTP and TLS. As the h5py maintainers (of which I’m one) naturally don’t want to have to keep track of security issues in curl (and any of its dependencies), the wheels uploaded to PyPI don’t include S3 support (but integrators like conda-forge do include it), so pip users who want S3 support are pointed at the “build from source” docs, or try to do things via fsspec (with occasional deadlocks and other weirdness). If users were aware that using pip does imply they’re implicitly choosing to be their own integrator, we’d likely see less issues there.

  1. Most of the pages about this are gone, but Ureka! – A new easy-to-install IRAF + PyRAF should provide enough context. ↩︎

  2. One big issue that occurred as part of the movement to Ureka was the disappearance of documentation of how to build pyraf from source, given it was not on PyPI and so you needed to scrabble around on old pages to even find the source. ↩︎

3 Likes

I’d assume so as well. If the baseline was something like 64-bit Windows, x86-64 macOS, and x86-64/aarch64 Linux (aarch64 also being a proxy for arm64 macOS, given that arm64 macOS is too tricky to run as a CI service still), that would have value I think. And as a check perhaps just “it builds and imports”, because a single flaky test shouldn’t prevent uploads.

The details here are a bit of a distraction from the main purpose of this thread, but it’s probably safe to say that if we’re aiming for a better experience for end users, having some minimal form of checking that packages work would be useful in principle.

Thanks @aragilar! I think the “find the right compilers” and IRAF topics are docs and too-much-choice issues, and I’ve tried to keep those out of pypackaging-native content. Also important, but easier to solve in principle that the bigger design issues. The h5py one would be great to describe though. I’ve opened add content about h5py and its issues with MPI, HDF5 and curl? · Issue #23 · pypackaging-native/pypackaging-native · GitHub for this.

Both “needs” and “wants” apply. Some people cannot use the various integrators, as you say. Some just don’t want to - maybe they don’t like the UI of the integrator, or they prefer more up to date versions of some packages that the integrator is lagging on, or some other “preference”, rather than “need” reason. Or maybe they do just want to have the extra flexibility.

My personal reasons are “wants” not “needs”. Which is why I chose that word. But yeah, both apply.

One other aspect of this. Unification of interface would let us modify the PyPI header that currently says “pip install X” as the install command for packages, to be generic. And provide advice for projects on how to write their “how to install” instructions in a generic form.

Conversely, if we don’t unify interfaces, how do we address this aspect of what users see as “too many options” - assuming we want to modify the current approach which implicitly prefers the pip/PyPI/wheel toolset over other integrators?

1 Like

FWIW, all versions of MSVC for the last eight years are compatible with all versions of CPython since 3.5.[1] Which I get is a bit confusing for people, but it does at least mean we don’t have to constantly update version numbers in docs every month.

The problem is that package developers don’t necessarily make their code work with MSVC in the first place, or they’re dependent on libraries that haven’t done it, or even OS functionality that doesn’t exist at all on Windows and fundamentally the package doesn’t make sense to port! So even if the user manages to get the tooling they can’t make the package work, because the developer never did.

The only thing likely to change on the MSVC side is a way for tools to be able to download and use a copy of the compiler. But we’re still talking 2GB+ downloads, which means virtually nobody should ever do this, and certainly not without getting the user to agree. Windows is designed around distributing binaries, and at this stage nothing much is going to change that - we’re far better off trying to fit into that than to resist it.

This is basically what everyone’s CI systems do, and it doesn’t actually make things any simpler.

Maybe when we get some build backends that are able to download non-Python dependencies as part of the sdist->wheel build, it’ll be feasible to use one of the existing CI systems to just -m build from sdists in a clean environment, but that’s going to be a complete build rewrite for many projects. We’ve got a lot of “just make it work” culture to unwind first, or alternatively, a new community of external build scripts for existing packages that ensure they all build in the same environment (e.g. all the distros).

(FTR, I agree with everything you’ve said in response to my earlier post, which is why I haven’t quoted any of it :wink: )

This one I think is fine for pip install to be the unified/generic command to show on PyPI, because it is how you get packages from PyPI. What’s missing is those cases where a user in a particular environment should not be getting certain packages from PyPI, but from their distributor.

I can imagine some kind of data file in the base Python install listing packages (or referencing a URL listing packages) that pip should warn before installing, and probably messages to display telling the user what they should prefer. Distributors could bundle this file in with their base Python package so that anyone using it gets redirected to the right source.[2] A similar kind of idea is going into the sysconfig work to let Linux distros specify their own settings, and I’m sure it generalises to packages, too.


  1. The C++ library, however, is not. So when extensions use C++ (CPython does not), they may end up with conflicting versions that generally shouldn’t be an issue, but could potentially lead to shared state issues. But this is fundamentally the same issue as trying to share versions of libpng or any other shared library, and is really the responsibility of the system integrator to solve. ↩︎

  2. Though they could do this today by publishing their own index of wheels and setting a default index-url value in the site’s pip.ini/pip.conf. There’s a few gaps in this, but fundamentally it’s straightforward and totally viable (and sufficient for my scenarios where I want to block PyPI entirely, just not for when you’re merely augmenting it). ↩︎

This is great news to hear as a definitive statement… and something
that really hasn’t made it out there. I relatively frequently come
across blogs that have finally solved how to build a certain
extension, and they invariably imply - or outright say - that ONLY
2017/15.x (for example) will work, don’t you dare try the thing you
might actually have installed on your system - or worse, the
corporate-approved/licensed version if you’re in that situation.

That sounds like you’re advocating mixing package sources? So pip install requests would be a correct install command to use, even if you’re otherwise using conda (for geospatial libraries, for example). And what about something closer to the grey area? Is it OK for the canonical instructions for how to install numpy (the ones people looking for numpy and ending up on PyPI would see) to be pip install numpy?

I’m absolutely fine with that, for the record, but it implies a much greater degree of integration between pip and conda than we’ve traditionally had. The same would be true of pip and Linux distros - our message for a long time has been that pip install X is absolutely the wrong way to install Python packages into your Linux system Python.

Maybe that’s what you’re suggesting with the “data file in the base Python” comment? So users would go to PyPI, find instructions to do pip install X, try that and get told “nope, you should use apt-get install python-X instead”? That doesn’t seem like it’s going to reduce user confusion, I’m afraid… (It’s OK for experienced users, but those are the sort of people for whom the current “be your own integrator” model is acceptable).