If Python started moving more code out of the stdlib and into PyPI packages, what technical mechanisms could packaging use to ease that transition?

To go back to Nathaniel’s original post, another technical problem is dependencies between standard library modules. For example, multiprocessing is used in concurrent.futures, so if users can uninstall or upgrade multiprocessing, concurrent.futures may break.

For example, multiprocessing is used in concurrent.futures, so if users can uninstall or upgrade multiprocessing, concurrent.futures may break.

Same goes for pip: every optionnal/upgradable parts of the stdlib used by pip should be vendored in pip.

If a module got pulled out of the stdlib then any dependent module will either go with it or be tweaked to not require it.

Currently, the the stdlib directory has higher priority than site-packages on sys.path, and none of our tools are prepared to mutate the stdlib directory, just site-packages. Of course we could change things if we want. But given how they work now, I guess the simplest approach would be:

  • Packages in state (2) or higher go in site-packages, because that’s the packages that can be upgraded, and only stuff in site-packages can be upgraded.
  • If we copy Ruby’s distinction between (2) and (3), then we’ll need some way to track which packages are in state (2), and update pip uninstall so that it checks the list and errors out if someone tries to remove one of the packages in (2).
  • And virtualenv and venv will need changes so that when you use --no-site-packages, they’ll prepopulate the new environment with all the packages in (2) and (3).

I guess it might be possible to allow the stdlib to depend on packages in state (2), since those are defined to always be available. But yeah, probably life would be simpler all around if we follow a rule where we always split out leaf packages first.

It might be OK if pip simply depends on those packages? Each new environment would get bootstrapped with copies of pip and all its dependencies. Then from that point, you have a working package manager, so assuming everything works correctly, your package manager’s regular dependency handling should prevent you from removing those packages (unless you also remove the package manager).

Of course, “assuming everything works correctly” might be too strong an assumption :-). We might prefer to vendor dependencies to make it harder for folks to shoot themselves in the foot by stuff like pip uninstall --force, or because we don’t trust pip’s resolver to actually work correctly yet.

This is already effectively the case, but I’d expect the bar we’d apply would be “does this module provide functionality required to install other packages” before moving anything out permanently (carefully phrased such that we don’t get committed to OpenSSL-as-a-public-API, while likely retaining urllib/equivalent as a public API).

That could be a non-trivial performance hit (depending on how many packages are in those categories, and their sizes). At the moment (3.7.3) creating a venv without pip on Windows consists of creating seven files (adding pip to that number bumps it up to 813!!!). With things like on-access virus scanners, creating a venv can be a significant cost, and creating a --without-pip venv a big saving (8 seconds vs 0.5 seconds in some experiments I’ve done). Loading --without-pip venvs up with copies of stdlib modules would be a definite step backward for some use cases I’m working on (see, for example, this PR for pipx).

This is getting to pretty low-level technical detail, though, to the point where it’s largely meaningless to discuss in terms of generalities. I’d need a better idea of which stdlib modules we’re talking about to really analyze the trade-offs.

That does sound like an important technical problem! Is it specificallythe number of files that’s an issue? Because if so, then it sounds like we should look into using zipimport for ensurepip, and any future ensureX libraries.

Installing packages in zip form is messy in the general case. But if we only worry about ensureX packages, maybe it’s not so bad. For example, for ensurepip we would install:

site-packages/

…/pip.zip

…/pip.pth

…/pip-VERSION.dist-info

And the .dist-info dir would have a few more files, including a RECORD that just lists pip.zip and pip.pth. If you later upgrade pip, it reverts back to the normal layout.

For ensuremultiprocessing, it would be similar, but adding _multiprocessing.pyd.

The trick being: since we don’t have to support arbitrary packages, we can be sure that no one is playing games with file. And we’re just trying to reduce the number of files rather than necessarily put everything in a single file like eggs do, so it’s ok to keep extensions outside of the zip.

I guess we should actually do this right now for pip (8 seconds is a lot!), but it would become even more urgent if we start moving more stuff into site-packages.

Another possibility: add a default-site-packages directory to the standard sys.path, where all the default-installed packages live. It goes after site-packages. If you upgrade a package, the new version goes in site-packages as usual, so it shadows the version in site-packages. For venvs, the parent environment’s default-site-packages gets unconditionally added to the child environment’s path, not copied, and not affected by --system-site-packages. So child venvs revert back to the original version of the default packages, which seems reasonable, it’s free, and then you can upgrade them in the venv if you want.

I guess the problem with this is that if you try to uninstall in the venv, then you instead just unshadow the original version … Maybe we’d want some kind of tombstone mechanism to prevent that?

1 Like

To be honest, I don’t really know if it’s about file counts. A lot of it is probably about what overheads virus scanners add (and don’t get me started on that :slight_smile:), and I wouldn’t be surprised if they introspected zipfiles.

The key point here is that one of the (hidden) advantages of core venvs over virtualenv is that venvs are extremely low footprint, because they can use interpreter support to avoid any of the copying that made virtualenv so complex and fragile. Having a multi-level stdlib (and in particular, changing the usage of site-packages to include some portions of the stdlib) will impact, and possibly negate, some of that advantage.

Don’t over-react to the details here, though. That 8 seconds is on a particularly old and overloaded laptop, so may not be that normal. Although conversely, it is easy as Python developers to forget that people do use Python on old, slow, or otherwise far from ideal systems, and what to us may be a “small” impact, isn’t always as small as it might be.

The pip issue is a special case, IMO, and needs to be solved at the pip level. Around 80% of pip’s footprint is vendored libraries. That’s not a core Python issue (I have some vaguely-formed ideas, but they aren’t for this thread). What is a problem for for this thread is if we force pip to vendor additional “optional stdlib” stuff, that increases pip’s footprint and adds copying to venvs, so potentially doubling the impact. But again, this is a completely different issue depending on whether we’re talking about a tiny little 1k-of-code module, or a huge package like tkinter (which pip doesn’t use, BTW…)

Note: Just as a side note, multiprocessing keeps getting used as the example, but I’d be really bothered if there were an actual intention to move multiprocessing out of the stdlib, rather than just using it as a convenient example…

File count is the right metric to worry about on Windows (and most network drives, for that matter). Virus scanners will introspect ZIP files, but provided they’re read only it’ll only happen once and it’s still much quicker than individual files. There are also other per-file overheads. (I’m working with some people to try and get an effort going to improve these across the board, but as they’re all little bits of overhead added by “unrelated” teams, it’s not really anyone’s “fault”…)

1 Like

I’m a huge fan of this approach to the “dead batteries” problem - having a core Python distribution that still contains all of the batteries but with the ability, like we do today with pip, to upgrade said packages from PyPI.

It would also provide an avenue for distributions to just provide the batteries-not-included version of Python. Many of them already do, and if you’ve not experienced it, it tends to be a surprise when you try to python3 -m venv and it blows up because the ensurepip package wasn’t installed.

Having an explicit CPython+batteries distribution that looks like Python does today, with a lighter distribution available that expects Pip (or equivalent) to be available I think would be a good thing.

The only problem I can think of is that right now when Python is installed it usually has most of the batteries, so though you may not have 3rd party libraries, at least you can still do an awful lot of useful things if your bureaucracy doesn’t let you install from PyPI. Even poor developers stuck with Python 2.4 (or worse) still have an amazing amount of power available to them. It’s possible that we would be doing future devs a disservice by providing an easy path to remove those useful batteries. :man_shrugging:

To reframe this a bit: this isn’t just about removing “dead batteries”. It’s about giving us more options in general. Dead batteries are one case where this would be useful, but hardly the only one. For example, the ssl module suffers from being stuck on the cpython release cycle; being able to ship security fixes and support for new TLS features on older pythons might be really valuable. Another example: urllib is an active hazard to users, but we can’t put something better like requests in the stdlib because most network library maintainers aren’t willing to lock themselves to the cpython release cycle. Will we actually make ssl installed-by-default-but-upgradable, or add requests to the default set? I dunno, those are complex decisions that we’d have to discuss on a case by case basis. I chose multiprocessing as an example here because I thought it was less likely to derail into discussing package-specific details :slight_smile:

To me the big meta-question is, what options can we add beyond “stdlib-only” and “pypi-only”, and how smooth can we make them for users and maintainers? Once we know that, we’ll be able to make more fine-grained, sensitive tradeoffs for each specific situation.

1 Like

Two PyPIs? This might, at least, let people decide for themselves how much they trust the Python ecosystem, which would be a function of their appetite for risk. Those with least could elect only to install the distributed Python. More flexible users could add the PyPI that served packages for which the distributed Python had skippable tests when the module/package wasn’t present. The third level would be essentially those who trust PyPI enough to allow people to download packages from it.

@njs, I like and agree with your framing of the question. For one of those additional options to have available, I think what you described above, using a default-site-packages holding vendored packages that can be shadowed by upgrades, might be a great solution for some cases. If the default version could be made always available with something like from __default__ import foo, it would allow stdlib to safely use such a package, and third-party packages that want to rely on that version could also import it that way. Python would then only be adopting a known version while at the same time users and the maintainers of the adopted package would be free to update it safely. For a package where stdlib might want to accept updates, e.g. if ssl were moved out this way, upgrading the default package could be allowed where pip can check what is the recommended version of it for stdlib on that version of Python.

I see similar ideas in this thread. Suppose we get rid of the standard library entirely. python --nonstd . Gets you whatever Python requires to boot. Brutally minimal.

Now distribute CPython with a default, named (virtual?) environment that includes the standard library. When you are running the Python interpreter as an application, rather than as an application runtime, you get that. Useful libraries an import statement away.

Include individually wrapped (packaged) standard library modules. Wheels, long sys.path-style folder-per-module, importlib hooks so that they are not importable by default, whatever. Since the individually wrapped modules are distributed with CPython, they can be added to a new Python environment instantly without any of that pesky internet access.

Start moving individually wrapped modules from the “importable by default” set to the “must be declared explicitly to be importable” set.

As a bonus could it become possible to remove the virtual from virtualenv because there was no big, system-specific default environment to overcome?

In this way a library that is a former- or soon-to-be- member of the standard library can be special without being importable by default.

pip could be patched to ignore packages installed on a particular part of sys path (its deps) to install same into the bare environment.

I’ve started a project called nonstdlib to produce a wheel per stdlib module. Also unusual in that one sdist produces 180+ wheels instead of the usual 1.

1 Like

Doug Hellman has written a blog post documenting the dependency graph of ensurepip.