If Python started moving more code out of the stdlib and into PyPI packages, what technical mechanisms could packaging use to ease that transition?

njs · May 24, 2019, 8:40pm

Sometimes people suggest splitting out parts of the current Python stdlib into independent packages. This could have a number of technical advantages (e.g. being able to ship fixes quicker, creating a supported way for users to slim down their environments by removing unused packages… this is important for containers and mobile apps). But it could also be disruptive for users. So it’s controversial.

For example: imagine that multiprocessing gets split out into a standalone package. Whenever a new version of Python is released, we bundle in the latest multiprocessing wheel, and install it by default in the environment. But after that it’s treated like any other third-party package, and can be upgraded, downgraded, or even removed, independently of the interpreter itself. (This is exactly how pip works right now, via the ensurepip mechanism. There’s also prior art in Ruby.)

This is an extremely hypothetical discussion. Currently there are absolutely no plans to move multiprocessing out of the stdlib. But, if we did, how could we mitigate the problems this would cause for users?

I see two big ways where this might be disruptive and packaging could help:

Problem 1: if there’s code that assumes multiprocessing is always available, then it will mostly work (because multiprocessing is installed by default), but it will break when run in a “slimmed down” environment. For interactive use, I think this isn’t a big deal – if someone does pip uninstall multiprocessing and then tries to import multiprocessing and gets an error, they can just pip install multiprocessing. But if someone is trying to automatically generate a slimmed-down container or redistributable app, then they need some automated way to figure out which packages to include and which to leave out. Normally we do this via package requires metadata. So, we need some way to move from a world where no-one declares their dependencies on multiprocessing, to a world where packages that use multiprocessing declare that in their metadata.

The good news is this doesn’t have to be perfect – if 99% of packages have the right metadata, then the rest can be fixed by hand based on user feedback. But if 99% of packages have the wrong metadata, then it’ll be a mess.

One idea: setuptools can probably detect with 99% accuracy whether a package uses multiprocessing, just by scanning all the .py files its bundling up to see if any say import multiprocessing. So maybe setuptools would start doing that by default? Plus there’d be some way to explicitly disable the checking for devs who want to have full manual control over their dependencies, maybe a shared library to implement the checking so flit and setuptools can share the code, …

I think that would get us pretty far. It takes care of future uploads, and existing sdists on PyPI (assuming you use a new setuptools when you build them). The main flaw I see is that it doesn’t help with existing wheels on PyPI. Maybe PyPI could also do a scan of existing wheels (using that shared library mentioned above), to add in extra requirements for formerly-stdlib packages? Or maybe pip should do that when downloading pre-built wheels? I guess this would argue for having a bit of metadata in the wheel to say explicitly “The creator of this wheel was aware that multiprocessing is no longer part of the stdlib, and took that into account when creating its requires metadata”, so that pip/PyPI would know whether to apply a heuristic or not.

Problem 2: Right now, removing things from the stdlib is complicated and super controversial (see the discussions about PEP 594). Some of this is inherent in the problem, but it would help if we had a smoother path for doing this. Having the infrastructure to bundle wheels with Python would already help with this. For example, if the Python core devs wanted to stop maintaining nntplib, they could do a multistep process:

Convert nntplib from a stdlib library → bundled wheel, so it’s still available by default but can be upgraded/removed independently
Use the tools described above to fix up metadata in any packages that require nntplib
Finally, flip the switch so that it’s no longer installed by default, and users have to either declare a dependency or manually pip install nntplib

In a perfect world, I think there would be a step 2.5, where we start issuing deprecation warnings for anyone who imports nntplib but hasn’t declared a dependency. (Since these are the cases that will break in step 3.) This suggests we might want some metadata in nntplib.dist-info to track whether it’s only installed because it’s bundled by default, or whether it’s been pulled in by a manual install or other dependencies? And then when nntplib is imported, issue a DeprecationWarning iff it’s only installed because of being bundled.

If anyone else is willing to join me in this thought experiment: what other problems do you see? And what do you think of the solutions sketched above?

tiran · May 24, 2019, 8:52pm

Thanks for starting this discussion.

I like to bring up one argument for a sufficiently large stdlib: The test suite for stdlib is a key factor for the overall quality of the interpreter. A smaller stdlib could decrease the quality because fewer code paths of the interpreter are executed.

gpshead · May 24, 2019, 8:53pm

Good #stdexit transition thoughts above!

The conversion of an existing stdlib module into a pre-bundled wheel such that it could be upgraded independently for a couple of release cycles sounds great. so that when it is removed, the previous few python versions will actually get the latest (potentially more modern) version when someone adds it to their third_party (pypi, etc) package requirements list instead of only an old non-bundled stdlib version being first in sys.path.

Such a thing also potentially paves the way for some stdlib modules to be done this way anyways such that they are included batteries that can also be upgraded in place rather than having backports installed using a new name for those needing more recent features. (which opens up its own slew of packaging and versioning questions…)

dstufft · May 25, 2019, 12:25am

I think it would actually be better if part of the CI for the Interpreter was actually running the test suite of downstream projects using the newly built interpreter. This would still provide the benefit of testing all of those code paths, while also testing code paths that maybe don’t exist in code that you’d target for the standard library but would in code that straddles multiple Python versions.

The pyca/cryptography project already does this, and I think it’s generally a good thing for them. The hard parts tend to come in selecting which test suites you’re going to run, because you generally want test suites that don’t have flakey tests and have fairly stable test invocations.

njs · May 25, 2019, 4:27am

A post was split to a new topic: If move packages out of the stdlib, who maintains them?

steve.dower · May 25, 2019, 5:21am

I don’t think problem 1 is even as bad as you make it sound. If you’re manually slimming down Python, then you’re either in complete control of the code that will be run with it, or you’re asking for problems (and will find them ). If you’re using a tool, that tool ought to take responsibility for scanning. Setuptools doesn’t have to solve this automatically to make the whole idea feasible.

For problem two, I already have that infrastructure for Windows, so it’s at least half solved

njs · May 25, 2019, 8:06am

I did some more reading on how Ruby approached this, and it turns out they actually have 4 different levels, not just 3:

Classic stdlib: code that ships as “part of” the interpreter, no package metadata
What they call “default gems”: treated like 3rd party packages that are installed by default – they have metadata, versions, can be upgraded – BUT there’s a flag set saying that you can’t remove them, and if you try to pip uninstall then it just errors out.
What they call “bundled gems”: ditto, but without the magic flag, so you can uninstall them if you want.
Actual 3rd party packages you get from their version of PyPI

I put these in order because there’s a progression here: code doesn’t jump straight from (1) to (3), it goes through (2) as an intermediate step.

Agreed, and this seems especially sensible for any packages that we’re bundling. We don’t want to ship a package that doesn’t actually work

That said, I’ve seen a lot of projects that run their tests against Python master, which is exactly what you’re suggesting just done by different people, and it turns out to be kinda difficult. The tests often break because Python intentionally broke them (e.g.), and it takes some time for downstream folks to sort things out to adapt. Right now, for stdlib packages, you just adapt immediately in the same commit. I’m not sure there’s any general solution here. If we go down this road then sometimes there will be some extra work to coordinate, and we’ll just have to figure out strategies to deal with that.

Sure, today. If we did split up multiprocessing as an independent library and tell people that they can simply pip uninstall it and that’s a supported configuration, then it would be nice to make it a bit safer :-).

Also, there’s a very long history of projects that try to do this using just heuristics – it’s a major part of what tools like pyInstaller try to do. They’ve all ended up growing huge piles of special cases. Letting individual projects state authoritatively which modules they do and don’t need seems like a more scalable approach to me.

pf_moore · May 25, 2019, 9:50am

If you want to keep this discussion on a purely technical basis, then I understand, and I’m happy for this comment to be declared “off topic”. But having multiple levels of “being in the stdlib” opens up the question of what exactly we mean when we say “Python”. At the moment, it’s pretty clear that this means “the interpreter plus the stdlib”, but a graded stdlib changes the situation.

Places where it matters what “Python” means:

Dependency management - “this application just needs a Python environment”
Legally - companies allow the use of “Python” in their systems
Communication - Windows now provides “Python” by default
Embedding - Vim includes Python as a scripting language

There’s also the “branding” question - “Python comes with batteries included”, or “one of the things I like about Python is that you don’t need external libraries to get things done”.

I think we need to be very clear, if we introduce a “graded stdlib” on what constitutes “Python”, and by implication, what we call the other levels of distribution/stdlib. Because of (2) above, this isn’t a non-trivial detail, it potentially has fairly significant legal implications (IANAL).

FWIW, I don’t think the basic question is that hard. “Python” is “whatever python.org distributes in its standard installers”. But the more we take out of that set, the more we need names for the “bigger” sets (the ones that match what people currently mean by “Python” - “Python with multiprocessing”, for example).

I think much of the heat in this debate is because people see it more about “changing what Python is” (in a naming sense) than about technical details of distribution and/or support.

steve.dower · May 26, 2019, 4:06am

Agreed, but I don’t think we need to require it. The first goal is “tree shaking”, where it’s possible to remove what isn’t required, as opposed to “you don’t get it if you don’t ask for it”, which is what people will assume if we make setuptools support a prerequisite (which looked like the path you were heading down).

I’ve seen this exact thing come up at work in a few places (Visual Studio extensions is the big one, when we made a whole lot of “core” functionality optional and required new metadata from all extensions), so I feel like I’m not speaking out of complete ignorance of the process and its I intended consequences.

njs · May 26, 2019, 8:40am

Let’s try to keep this thread focused on understanding the technical trade-offs, rather than jumping straight to the overall discussion? But FWIW I think this is basically right – there will still be some set of modules that the python core team says should be included in python installs by default, vouches for, and ships. And I think if you install from python.org and then type import multiprocessing, that will always work, even if we change the details of how the underlying files are laid out.

Tree-shaking is definitely one goal that could benefit from splitting up the stdlib, but I guess it’s somewhat orthogonal – people can and do write tree-shaking scripts right now, often on the level of individual .py files. It’s pretty easy to rm -rf stdlib/multiprocessing/, you don’t need package metadata to do that :-). The case where you really need package metadata is when you want to be able to upgrade multiprocessing, e.g. to tell you which version you have installed.

So why should we care about other packages explicitly declaring their dependencies on multiprocessing? Looking at Ruby’s 4 levels of stdlib inclusion summarized in this post:

Level (1), package is just part of the stdlib. Declaring a dependency on bits of the stdlib isn’t even possible.
Level (2), package has its own name and version number, but can’t be uninstalled, so there’s always some version available. Declaring a dependency on a level (2) package is possible, but if you don’t, no harm done, it’s there anyway.
Level (3), package is installed by default, but can be uninstalled. Declaring a dependency on a level (3) package is desirable, but if you don’t, then the impact is limited to folks who are going out of their way to try to shrink their footprint. But in these days of docker, that’s a lot of people, so it’d be nice if this worked reliably.
Level (4), a regular PyPI package. Declaring a dependency on a level (4) package is basically mandatory. Of course like any missing dependency, people can work around it, but in the mean time it causes widespread breakage.

Looking at the benefits people usually hope to get from splitting up the stdlib, I think you actually get a lot of them already at level (2). But for full benefit we’ll need to be able to eventually move some packages to levels (3) and (4), which means we need a way to do this without causing widespread breakage, which is why we need mechanisms to make dependency declarations more widespread.

That does sound relevant :-). How did you handle it?

pitrou · May 26, 2019, 5:23pm

Framing the discussion in technical terms is missing the point IMHO. The problem is not really to ask people to at some point type pip install nntplib (though there are problems with beginner users, students etc.). The real problem is that the level of trust in PyPI packages is not the same as in stdlib modules (on many levels: of course there are potential, though rare, security issues, but more fundamentally the stdlib promises to do its best not to break APIs, not to ship regressions, to stay compatible with the latest Python version, etc.).

By the way, it’s interesting that you’re choosing multiprocessing as an example, because multiprocessing is one of those packages that got massively better thanks to being promoted and maintained in the stdlib (even though its original inclusion was a trainwreck).

steve.dower · May 26, 2019, 5:58pm

In short, a major version change, new package metadata, end-user warning prompts for packages without it (that are now blocked, one more version later), and a whole lot of reaching out to those extension authors who needed to do some work.

Even then, there was a lot of “let’s ignore the bad feedback and complaints because we genuinely believe this is better”. And considering it took VS from a minimum 10GB install to under 1GB, plenty of people were happy about the related benefits. But not everyone made the connection between install size and extensions popping up new warnings. I’m not sure Python will get such a visible improvement, which may make it harder to survive new issues…

brettcannon · May 26, 2019, 6:31pm

Does anyone know why Ruby added step 2? I get that it adds an extra step to promote people added dependencies metadata to their code, but does the lack of uninstall add to that specifically? Depending on how all of this is handled, if you used e.g. pip-tools sync and multiprocessing was uninstalled, then the import error would hit and you would (hopefully) realize you need to add multiprocessing as a dependency. Now if you can’t uninstall it then (hopefully) the “you can’t uninstall this (yet), but make sure to add your dependency now”. So I assume the hope is the “can’t uninstall” message is easier to understand, but that’s an assumption on my part and I’m wondering if we know that’s what occurs in the Ruby community.

I think another way to look at this is the interaction with virtual environments. In what step do we stop making the package available in a venv when --system-site-packages is not specified? I suspect that will be another shift that will catch people off-guard if they have not been paying attention to warning messages that a module is shifting from being included in the stdlib out-of-the-box versus explicitly installed/specified.

dstufft · May 26, 2019, 7:17pm

I think implicit in any of these discussions is the idea that we wouldn’t be preinstalling any library into Python that doesn’t make those same promises. It would simply be part of the process for vetting a library for inclusion into the “default set”. Specific to any library that we split out, if the Python core team is no longer maintaining it, then part of the vetting of new maintainers would be people we believe would do just that.

I don’t think anyone wants to just start YOLO bundling modules with no track history to back them up on how they’re going to handle stability.

I think even if the Python core continues to maintain a particular module, there are still large benefits to splitting it out and moving it onto PyPI and making it part of the “default set”.

I think that perhaps this is jumping to a conclusion? We’ve never had a library that was installed as part of the “default set”, so it’s impossible to say if multiprocessing would or wouldn’t have had the same experience if it was a library on PyPI that we installed into the default set.

pitrou · May 26, 2019, 7:38pm

Right, I didn’t mean that. But in light of the more general discussion, I just wanted to point out that multiprocessing is an example of a package that was actually bonified thanks to being in the stdlib.

njs · May 26, 2019, 11:01pm

There are a lot of points to consider. I’m trying to split the problem into pieces, so that later when we do the overall “is it a good idea?” discussion we can do it in a more informed way. There’s also this thread if you want to go deeper on the maintenance issues.

nchammas · May 28, 2019, 8:56pm

Is this a typo?

pitrou · May 28, 2019, 9:07pm

I don’t think so: https://en.wiktionary.org/wiki/bonify#English

pradyunsg · May 29, 2019, 1:15pm

IMO step 4.

Until then, the package is in the standard library by default and should be in virtual environments as well.

pf_moore · May 29, 2019, 2:05pm

Corresponding to this question, at what point do we move the package from lib to lib/site-packages? If the answer is (3), then having the package still visible under --system-site-packages needs some additional work. If the answer is (4), then uninstalling needs some additional work.

In both cases, external projects (virtualenv and pip respectively, and quite possibly others as well) will need to change, so neither choice is something that can be done without community assistance.