Split tags into a separate package

pip, setuptools and (since yesterday) wheel all depend on packaging. All of these are PyPA projects. Both pip and setuptools vendor in a copy of packaging. Since wheel can be pip install ed, the maintainer would rather not vendor it in. This creates a problem for tools who pip install wheel (like virtualenv), since wheel then goes off and installs packaging, which itself installs pyparser and six: all told 4 network requests and ~250k of downloads. See the discussion starting here. Some possible solutions, in my personal preference from worst to best:

  • vendor packaging into wheel too
  • vendor only packaging.tags into wheel, since that is the only part it needs
  • expose the vendored packaging.tags from pip
  • move packaging.tags to a separate, pip install able package

Thoughts?

Matti

For what it’s worth I vote for this. :+1:

I’d add that it may be worthwhile to make it a namespace package.

Off topic, but I’ve always been slightly bothered by the fact that packaging depends on pyparsing. It is not a common dependency (unlike six), the installation size is significant (twice as large as packaging itself), and the error message is suboptimal. Would it be a good idea to consider re-implement parsing inside packaging with only the standard library?

1 Like

It also seems like six is used exactly once, which could easily be handled without the dependency.

The use of pyparsing seems more legitimate though. That code would be significantly more complex without it, and pyparsing itself doesn’t have any other dependencies. Looks easy to vendor though.

Splitting packaging up into separately installable parts, and limiting its dependencies seems like a reasonable option to me. In essence, we’re saying that packaging is “packaging infrastructure” code, and to make it usable in tools that bootstrap the packaging experience, it can’t use non-stdlib libraries as freely as “normal” code does.

But I will say that the more we do this, the more we’re acknowledging a fundamental limitation of Python packaging, that we don’t (won’t, can’t) “eat our own dogfood” in the sense that we want to make it possible for people to use packages off PyPI, but we claim that our own code is special, and can’t do that. I was mildly uncomfortable with pip having to do that (pip is huge because of everything we vendor) but the case for pip not being able to depend on non-stdlib packages is much stronger than for other tools (the chicken and egg issue)¹. Surely any "problem for tools who pip install wheel" is also a problem for tools that pip install requests or indeed any other large package? That’s a strawman, but I genuinely would like to know what the real issue is here. Is it that virtualenv pre-installs a load of stuff and that allows people to use (say) pyparsing without remembering to explicitly install it? Or is it the download times (which pip’s cache should mitigate) or something else?

How far do we want to go down the javascript route of splitting 3rd party libraries into tiny pieces?

¹ Stronger, but not absolute - I still think we’d be better with a world where a very basic wheel-only installer came with Python and only used the stdlib, and we used that to bootstrap more capable frontends like pip, opening up the world for more competition in frontends, the way PEP 517 broke setuptools’ monopoly on backends.

1 Like

At the moment virtualenv can’t do that, so we’ll need to add implementation for ability to install dependencies of the embedded packages. This is less an issue overall, but something we should have kept note of before releasing wheel with dependencies.

The more dependencies pip/setuptools/wheels take on the higher the chances we’d provide something out box in virtual environments that’s not provided in system installations. So misisng to add those dependencies explicitly is definitely a potential issue.

It’s not just download times, it’s also the fact that seed packages (and their dependencies, that until now were none), need to always exist in all environments, This means taking on extra resources in multitude of domains:

  • network overhead (this can be mitigated by the cache, but that doesn’t help on a cold start),
  • installation overhead (there’s more stuff to install now, so there’s more time needed to perform that, creating environments becomes slower),
  • disk space overhead (now every environment needs to pay the disk space overhead, which can quickly add up) - this would be less painful if wheel would actually be using pyparsing… but seems it’s not so we’re essentially pulling in 64K compressed for no gain.

I agree with @steve.dower that packaging could drop its “six” dependency. As for pyparsing, the ideal solution in my opinion would be the ability to install packaging without that dependency. That would require that the dependency be placed in extras, and that this extra would be installed by default. Right now we don’t have this capability.

In the mean time, I chose to vendor the packaging.tags module in wheel, removing packaging as an install dependency. I’ve released wheel v0.35.1 with this change plus a fix for broken FreeBSD platform tag issue.

1 Like

It could be solved with a hand-written recursive-descent parser for PEP 508. I don’t think the grammar is too bad; honestly the URL parsing bit is way more than what PEP 508 introduces.

Not that I’m specifically volunteering to write said parser. :grin:

I think this ties into Adding a default extra_require environment.

1 Like

Does this wheel-only installer include the ability to build from source? If pip flipped its default from building from source to not building from source, wheel’s bdist_wheel would not be needed in many cases.

Well, it doesn’t exist, so it can do whatever you want it to do :slightly_smiling_face: But my idea is that building from source should be viewed as the exception (needing a 3rd party package) rather than the rule. And stdlib python only needs to be able to get pre-built wheels from PyPI to bootstrap the ecosystem. The “build everything from source” people would probably argue with this POV, though.

Um, pip doesn’t have a default of building from source. Pip prefers wheels by default…

1 Like

Sorry for not being clear. I meant pip should fail to install anything if a binary wheel is not found by default. This means making --binary-only :all: the default. If a user wants, they can specify another option to get pip to build from source.

I doubt that would ever happen - it’s too big a breakage for no real benefit. And it wouldn’t mean bdist_wheel was needed any less, just that when a pre-built wheel wasn’t available the user would get a different error…

1 Like

different error… but at least a possibly more usable one? Right now virtually every Windows user who tries to install when there’s not a suitable pre-built wheel has to sit through a long delay and then a not very enlightening failure message (since they almost certainly don’t have a compiler, or if they do it’s almost certainly not set up with necessary bits). There may well be other ways to get better here, but I’ve had to explain what this means to multiple dozens of Python newbies.

That’s actually a very good point. The user base has probably changed a lot from the days when pip was first developed (it definitely has - pip was originally only able to build from source, binary distributions weren’t even supported). So maybe “binary only by default” is more acceptable these days.

The major problem, though, would be pure Python packages, which can be built from source by anyone. And there are definitely a non-trivial number of such projects that only distribute sdists. So I still think the breakage would be too significant.

1 Like

Yep, I agree here. The vast majority of packages are still pure Python, and in that case it’s totally understandable that publishers have not released two copies of essentially identical files.

I’ve been encouraging the prefer-binary option, but have only suggested only-binary for installs from private indexes where we know everything has already been converted into a wheel (mostly as a defence-in-depth in case access to the public index leaks in through insecure settings).

It’s far easier for packagers to use any of the free, cross-platform CI systems to build and release their wheels than it is to complain to the pip team to get them to change defaults :wink:

I’m definitely not a fan of adding/vendoring the source file for packaging.tags into other packages where they can diverge (that’s precisely the problem that writing packaging.tags was supposed to solve!). I also don’t think the solution here is to create more packages to be maintained – we don’t have infinite maintainer capacity, so keeping the number of moving parts to a minimum is a good idea IMO.

Honestly, I think the “best” solution is to make packaging not have dependencies. This would involve dropping pyparsing and six.

Both of them are used only in packaging.requirements. pyparsing is used to generate the parser for PEP 508 requirements. six is used for getting urlib.urlparse.urlparse in a Py2-Py3 compatible way. six should be super easy to drop (PRs welcome!) and ideally, someone would write a hand-written parser for PEP 508 requirements that we’d use in packaging instead of depending on pyparsing – that should solve all the major-ish concerns that sparked most of this discussion.

2 Likes

There you go! Remove dependency on six to make package lighter by hugovk · Pull Request #331 · pypa/packaging · GitHub

1 Like

I remain uncomfortable with the idea that we’re building a package management infrastructure and yet we are actively avoiding using packages ourselves. I understand that there are bootstrapping reasons, but I think we should be looking more closely at what’s missing from the packaging toolset that means we have to take this decision.

People who want to decouple the standard library from core python should also take note here…

2 Likes

Likewise myself. OTOH, I don’t want us to add more bodges in the meantime. (saying that with both my pip maintainer and packaging maintainer hats on) :slight_smile: