If Python started moving more code out of the stdlib and into PyPI packages, what technical mechanisms could packaging use to ease that transition?

Framing the discussion in technical terms is missing the point IMHO. The problem is not really to ask people to at some point type pip install nntplib (though there are problems with beginner users, students etc.). The real problem is that the level of trust in PyPI packages is not the same as in stdlib modules (on many levels: of course there are potential, though rare, security issues, but more fundamentally the stdlib promises to do its best not to break APIs, not to ship regressions, to stay compatible with the latest Python version, etc.).

By the way, it’s interesting that you’re choosing multiprocessing as an example, because multiprocessing is one of those packages that got massively better thanks to being promoted and maintained in the stdlib (even though its original inclusion was a trainwreck).

In short, a major version change, new package metadata, end-user warning prompts for packages without it (that are now blocked, one more version later), and a whole lot of reaching out to those extension authors who needed to do some work.

Even then, there was a lot of “let’s ignore the bad feedback and complaints because we genuinely believe this is better”. And considering it took VS from a minimum 10GB install to under 1GB, plenty of people were happy about the related benefits. But not everyone made the connection between install size and extensions popping up new warnings. I’m not sure Python will get such a visible improvement, which may make it harder to survive new issues…

Does anyone know why Ruby added step 2? I get that it adds an extra step to promote people added dependencies metadata to their code, but does the lack of uninstall add to that specifically? Depending on how all of this is handled, if you used e.g. pip-tools sync and multiprocessing was uninstalled, then the import error would hit and you would (hopefully) realize you need to add multiprocessing as a dependency. Now if you can’t uninstall it then (hopefully) the “you can’t uninstall this (yet), but make sure to add your dependency now”. So I assume the hope is the “can’t uninstall” message is easier to understand, but that’s an assumption on my part and I’m wondering if we know that’s what occurs in the Ruby community.

I think another way to look at this is the interaction with virtual environments. In what step do we stop making the package available in a venv when --system-site-packages is not specified? I suspect that will be another shift that will catch people off-guard if they have not been paying attention to warning messages that a module is shifting from being included in the stdlib out-of-the-box versus explicitly installed/specified.

I think implicit in any of these discussions is the idea that we wouldn’t be preinstalling any library into Python that doesn’t make those same promises. It would simply be part of the process for vetting a library for inclusion into the “default set”. Specific to any library that we split out, if the Python core team is no longer maintaining it, then part of the vetting of new maintainers would be people we believe would do just that.

I don’t think anyone wants to just start YOLO bundling modules with no track history to back them up on how they’re going to handle stability.

I think even if the Python core continues to maintain a particular module, there are still large benefits to splitting it out and moving it onto PyPI and making it part of the “default set”.

I think that perhaps this is jumping to a conclusion? We’ve never had a library that was installed as part of the “default set”, so it’s impossible to say if multiprocessing would or wouldn’t have had the same experience if it was a library on PyPI that we installed into the default set.

Right, I didn’t mean that. But in light of the more general discussion, I just wanted to point out that multiprocessing is an example of a package that was actually bonified thanks to being in the stdlib.

There are a lot of points to consider. I’m trying to split the problem into pieces, so that later when we do the overall “is it a good idea?” discussion we can do it in a more informed way. There’s also this thread if you want to go deeper on the maintenance issues.

Is this a typo?

I don’t think so: https://en.wiktionary.org/wiki/bonify#English

1 Like

IMO step 4.

Until then, the package is in the standard library by default and should be in virtual environments as well.

Corresponding to this question, at what point do we move the package from lib to lib/site-packages? If the answer is (3), then having the package still visible under --system-site-packages needs some additional work. If the answer is (4), then uninstalling needs some additional work.

In both cases, external projects (virtualenv and pip respectively, and quite possibly others as well) will need to change, so neither choice is something that can be done without community assistance.

To go back to Nathaniel’s original post, another technical problem is dependencies between standard library modules. For example, multiprocessing is used in concurrent.futures, so if users can uninstall or upgrade multiprocessing, concurrent.futures may break.

For example, multiprocessing is used in concurrent.futures, so if users can uninstall or upgrade multiprocessing, concurrent.futures may break.

Same goes for pip: every optionnal/upgradable parts of the stdlib used by pip should be vendored in pip.

If a module got pulled out of the stdlib then any dependent module will either go with it or be tweaked to not require it.

Currently, the the stdlib directory has higher priority than site-packages on sys.path, and none of our tools are prepared to mutate the stdlib directory, just site-packages. Of course we could change things if we want. But given how they work now, I guess the simplest approach would be:

  • Packages in state (2) or higher go in site-packages, because that’s the packages that can be upgraded, and only stuff in site-packages can be upgraded.
  • If we copy Ruby’s distinction between (2) and (3), then we’ll need some way to track which packages are in state (2), and update pip uninstall so that it checks the list and errors out if someone tries to remove one of the packages in (2).
  • And virtualenv and venv will need changes so that when you use --no-site-packages, they’ll prepopulate the new environment with all the packages in (2) and (3).

I guess it might be possible to allow the stdlib to depend on packages in state (2), since those are defined to always be available. But yeah, probably life would be simpler all around if we follow a rule where we always split out leaf packages first.

It might be OK if pip simply depends on those packages? Each new environment would get bootstrapped with copies of pip and all its dependencies. Then from that point, you have a working package manager, so assuming everything works correctly, your package manager’s regular dependency handling should prevent you from removing those packages (unless you also remove the package manager).

Of course, “assuming everything works correctly” might be too strong an assumption :-). We might prefer to vendor dependencies to make it harder for folks to shoot themselves in the foot by stuff like pip uninstall --force, or because we don’t trust pip’s resolver to actually work correctly yet.

This is already effectively the case, but I’d expect the bar we’d apply would be “does this module provide functionality required to install other packages” before moving anything out permanently (carefully phrased such that we don’t get committed to OpenSSL-as-a-public-API, while likely retaining urllib/equivalent as a public API).

That could be a non-trivial performance hit (depending on how many packages are in those categories, and their sizes). At the moment (3.7.3) creating a venv without pip on Windows consists of creating seven files (adding pip to that number bumps it up to 813!!!). With things like on-access virus scanners, creating a venv can be a significant cost, and creating a --without-pip venv a big saving (8 seconds vs 0.5 seconds in some experiments I’ve done). Loading --without-pip venvs up with copies of stdlib modules would be a definite step backward for some use cases I’m working on (see, for example, this PR for pipx).

This is getting to pretty low-level technical detail, though, to the point where it’s largely meaningless to discuss in terms of generalities. I’d need a better idea of which stdlib modules we’re talking about to really analyze the trade-offs.

That does sound like an important technical problem! Is it specificallythe number of files that’s an issue? Because if so, then it sounds like we should look into using zipimport for ensurepip, and any future ensureX libraries.

Installing packages in zip form is messy in the general case. But if we only worry about ensureX packages, maybe it’s not so bad. For example, for ensurepip we would install:

site-packages/

…/pip.zip

…/pip.pth

…/pip-VERSION.dist-info

And the .dist-info dir would have a few more files, including a RECORD that just lists pip.zip and pip.pth. If you later upgrade pip, it reverts back to the normal layout.

For ensuremultiprocessing, it would be similar, but adding _multiprocessing.pyd.

The trick being: since we don’t have to support arbitrary packages, we can be sure that no one is playing games with file. And we’re just trying to reduce the number of files rather than necessarily put everything in a single file like eggs do, so it’s ok to keep extensions outside of the zip.

I guess we should actually do this right now for pip (8 seconds is a lot!), but it would become even more urgent if we start moving more stuff into site-packages.

Another possibility: add a default-site-packages directory to the standard sys.path, where all the default-installed packages live. It goes after site-packages. If you upgrade a package, the new version goes in site-packages as usual, so it shadows the version in site-packages. For venvs, the parent environment’s default-site-packages gets unconditionally added to the child environment’s path, not copied, and not affected by --system-site-packages. So child venvs revert back to the original version of the default packages, which seems reasonable, it’s free, and then you can upgrade them in the venv if you want.

I guess the problem with this is that if you try to uninstall in the venv, then you instead just unshadow the original version … Maybe we’d want some kind of tombstone mechanism to prevent that?

1 Like

To be honest, I don’t really know if it’s about file counts. A lot of it is probably about what overheads virus scanners add (and don’t get me started on that :slight_smile:), and I wouldn’t be surprised if they introspected zipfiles.

The key point here is that one of the (hidden) advantages of core venvs over virtualenv is that venvs are extremely low footprint, because they can use interpreter support to avoid any of the copying that made virtualenv so complex and fragile. Having a multi-level stdlib (and in particular, changing the usage of site-packages to include some portions of the stdlib) will impact, and possibly negate, some of that advantage.

Don’t over-react to the details here, though. That 8 seconds is on a particularly old and overloaded laptop, so may not be that normal. Although conversely, it is easy as Python developers to forget that people do use Python on old, slow, or otherwise far from ideal systems, and what to us may be a “small” impact, isn’t always as small as it might be.

The pip issue is a special case, IMO, and needs to be solved at the pip level. Around 80% of pip’s footprint is vendored libraries. That’s not a core Python issue (I have some vaguely-formed ideas, but they aren’t for this thread). What is a problem for for this thread is if we force pip to vendor additional “optional stdlib” stuff, that increases pip’s footprint and adds copying to venvs, so potentially doubling the impact. But again, this is a completely different issue depending on whether we’re talking about a tiny little 1k-of-code module, or a huge package like tkinter (which pip doesn’t use, BTW…)

Note: Just as a side note, multiprocessing keeps getting used as the example, but I’d be really bothered if there were an actual intention to move multiprocessing out of the stdlib, rather than just using it as a convenient example…

File count is the right metric to worry about on Windows (and most network drives, for that matter). Virus scanners will introspect ZIP files, but provided they’re read only it’ll only happen once and it’s still much quicker than individual files. There are also other per-file overheads. (I’m working with some people to try and get an effort going to improve these across the board, but as they’re all little bits of overhead added by “unrelated” teams, it’s not really anyone’s “fault”…)

1 Like

I’m a huge fan of this approach to the “dead batteries” problem - having a core Python distribution that still contains all of the batteries but with the ability, like we do today with pip, to upgrade said packages from PyPI.

It would also provide an avenue for distributions to just provide the batteries-not-included version of Python. Many of them already do, and if you’ve not experienced it, it tends to be a surprise when you try to python3 -m venv and it blows up because the ensurepip package wasn’t installed.

Having an explicit CPython+batteries distribution that looks like Python does today, with a lighter distribution available that expects Pip (or equivalent) to be available I think would be a good thing.

The only problem I can think of is that right now when Python is installed it usually has most of the batteries, so though you may not have 3rd party libraries, at least you can still do an awful lot of useful things if your bureaucracy doesn’t let you install from PyPI. Even poor developers stuck with Python 2.4 (or worse) still have an amazing amount of power available to them. It’s possible that we would be doing future devs a disservice by providing an easy path to remove those useful batteries. :man_shrugging: