Split tags into a separate package

mattip · August 14, 2020, 10:29am

pip, setuptools and (since yesterday) wheel all depend on packaging. All of these are PyPA projects. Both pip and setuptools vendor in a copy of packaging. Since wheel can be pip install ed, the maintainer would rather not vendor it in. This creates a problem for tools who pip install wheel (like virtualenv), since wheel then goes off and installs packaging, which itself installs pyparser and six: all told 4 network requests and ~250k of downloads. See the discussion starting here. Some possible solutions, in my personal preference from worst to best:

vendor packaging into wheel too
vendor only packaging.tags into wheel, since that is the only part it needs
expose the vendored packaging.tags from pip
move packaging.tags to a separate, pip install able package

Thoughts?

Matti

bernatgabor · August 14, 2020, 10:32am

For what it’s worth I vote for this.

uranusjr · August 14, 2020, 10:43am

I’d add that it may be worthwhile to make it a namespace package.

Off topic, but I’ve always been slightly bothered by the fact that packaging depends on pyparsing. It is not a common dependency (unlike six), the installation size is significant (twice as large as packaging itself), and the error message is suboptimal. Would it be a good idea to consider re-implement parsing inside packaging with only the standard library?

steve.dower · August 14, 2020, 12:07pm

It also seems like six is used exactly once, which could easily be handled without the dependency.

The use of pyparsing seems more legitimate though. That code would be significantly more complex without it, and pyparsing itself doesn’t have any other dependencies. Looks easy to vendor though.

pf_moore · August 14, 2020, 12:59pm

Splitting packaging up into separately installable parts, and limiting its dependencies seems like a reasonable option to me. In essence, we’re saying that packaging is “packaging infrastructure” code, and to make it usable in tools that bootstrap the packaging experience, it can’t use non-stdlib libraries as freely as “normal” code does.

But I will say that the more we do this, the more we’re acknowledging a fundamental limitation of Python packaging, that we don’t (won’t, can’t) “eat our own dogfood” in the sense that we want to make it possible for people to use packages off PyPI, but we claim that our own code is special, and can’t do that. I was mildly uncomfortable with pip having to do that (pip is huge because of everything we vendor) but the case for pip not being able to depend on non-stdlib packages is much stronger than for other tools (the chicken and egg issue)¹. Surely any "problem for tools who pip install wheel" is also a problem for tools that pip install requests or indeed any other large package? That’s a strawman, but I genuinely would like to know what the real issue is here. Is it that virtualenv pre-installs a load of stuff and that allows people to use (say) pyparsing without remembering to explicitly install it? Or is it the download times (which pip’s cache should mitigate) or something else?

How far do we want to go down the javascript route of splitting 3rd party libraries into tiny pieces?

¹ Stronger, but not absolute - I still think we’d be better with a world where a very basic wheel-only installer came with Python and only used the stdlib, and we used that to bootstrap more capable frontends like pip, opening up the world for more competition in frontends, the way PEP 517 broke setuptools’ monopoly on backends.

bernatgabor · August 14, 2020, 1:23pm

At the moment virtualenv can’t do that, so we’ll need to add implementation for ability to install dependencies of the embedded packages. This is less an issue overall, but something we should have kept note of before releasing wheel with dependencies.

The more dependencies pip/setuptools/wheels take on the higher the chances we’d provide something out box in virtual environments that’s not provided in system installations. So misisng to add those dependencies explicitly is definitely a potential issue.

It’s not just download times, it’s also the fact that seed packages (and their dependencies, that until now were none), need to always exist in all environments, This means taking on extra resources in multitude of domains:

network overhead (this can be mitigated by the cache, but that doesn’t help on a cold start),
installation overhead (there’s more stuff to install now, so there’s more time needed to perform that, creating environments becomes slower),
disk space overhead (now every environment needs to pay the disk space overhead, which can quickly add up) - this would be less painful if wheel would actually be using pyparsing… but seems it’s not so we’re essentially pulling in 64K compressed for no gain.

agronholm · August 14, 2020, 1:23pm

I agree with @steve.dower that packaging could drop its “six” dependency. As for pyparsing, the ideal solution in my opinion would be the ability to install packaging without that dependency. That would require that the dependency be placed in extras, and that this extra would be installed by default. Right now we don’t have this capability.

agronholm · August 14, 2020, 7:27pm

In the mean time, I chose to vendor the packaging.tags module in wheel, removing packaging as an install dependency. I’ve released wheel v0.35.1 with this change plus a fix for broken FreeBSD platform tag issue.

brettcannon · August 14, 2020, 7:33pm

It could be solved with a hand-written recursive-descent parser for PEP 508. I don’t think the grammar is too bad; honestly the URL parsing bit is way more than what PEP 508 introduces.

Not that I’m specifically volunteering to write said parser.

I think this ties into Adding a default extra_require environment.

mattip · August 16, 2020, 10:59am

Does this wheel-only installer include the ability to build from source? If pip flipped its default from building from source to not building from source, wheel’s bdist_wheel would not be needed in many cases.

pf_moore · August 16, 2020, 11:47am

Well, it doesn’t exist, so it can do whatever you want it to do But my idea is that building from source should be viewed as the exception (needing a 3rd party package) rather than the rule. And stdlib python only needs to be able to get pre-built wheels from PyPI to bootstrap the ecosystem. The “build everything from source” people would probably argue with this POV, though.

Um, pip doesn’t have a default of building from source. Pip prefers wheels by default…

mattip · August 16, 2020, 11:55am

Sorry for not being clear. I meant pip should fail to install anything if a binary wheel is not found by default. This means making --binary-only :all: the default. If a user wants, they can specify another option to get pip to build from source.

pf_moore · August 16, 2020, 12:08pm

I doubt that would ever happen - it’s too big a breakage for no real benefit. And it wouldn’t mean bdist_wheel was needed any less, just that when a pre-built wheel wasn’t available the user would get a different error…

mwichmann · August 18, 2020, 7:06pm

different error… but at least a possibly more usable one? Right now virtually every Windows user who tries to install when there’s not a suitable pre-built wheel has to sit through a long delay and then a not very enlightening failure message (since they almost certainly don’t have a compiler, or if they do it’s almost certainly not set up with necessary bits). There may well be other ways to get better here, but I’ve had to explain what this means to multiple dozens of Python newbies.

pf_moore · August 18, 2020, 8:32pm

That’s actually a very good point. The user base has probably changed a lot from the days when pip was first developed (it definitely has - pip was originally only able to build from source, binary distributions weren’t even supported). So maybe “binary only by default” is more acceptable these days.

The major problem, though, would be pure Python packages, which can be built from source by anyone. And there are definitely a non-trivial number of such projects that only distribute sdists. So I still think the breakage would be too significant.

steve.dower · August 19, 2020, 12:17am

Yep, I agree here. The vast majority of packages are still pure Python, and in that case it’s totally understandable that publishers have not released two copies of essentially identical files.

I’ve been encouraging the prefer-binary option, but have only suggested only-binary for installs from private indexes where we know everything has already been converted into a wheel (mostly as a defence-in-depth in case access to the public index leaks in through insecure settings).

It’s far easier for packagers to use any of the free, cross-platform CI systems to build and release their wheels than it is to complain to the pip team to get them to change defaults

pradyunsg · August 19, 2020, 5:19am

I’m definitely not a fan of adding/vendoring the source file for packaging.tags into other packages where they can diverge (that’s precisely the problem that writing packaging.tags was supposed to solve!). I also don’t think the solution here is to create more packages to be maintained – we don’t have infinite maintainer capacity, so keeping the number of moving parts to a minimum is a good idea IMO.

Honestly, I think the “best” solution is to make packaging not have dependencies. This would involve dropping pyparsing and six.

Both of them are used only in packaging.requirements. pyparsing is used to generate the parser for PEP 508 requirements. six is used for getting urlib.urlparse.urlparse in a Py2-Py3 compatible way. six should be super easy to drop (PRs welcome!) and ideally, someone would write a hand-written parser for PEP 508 requirements that we’d use in packaging instead of depending on pyparsing – that should solve all the major-ish concerns that sparked most of this discussion.

hugovk · August 19, 2020, 7:00am

There you go! Remove dependency on six to make package lighter by hugovk · Pull Request #331 · pypa/packaging · GitHub

pf_moore · August 19, 2020, 7:02am

I remain uncomfortable with the idea that we’re building a package management infrastructure and yet we are actively avoiding using packages ourselves. I understand that there are bootstrapping reasons, but I think we should be looking more closely at what’s missing from the packaging toolset that means we have to take this decision.

People who want to decouple the standard library from core python should also take note here…

pradyunsg · August 19, 2020, 7:06am

Likewise myself. OTOH, I don’t want us to add more bodges in the meantime. (saying that with both my pip maintainer and packaging maintainer hats on)