Drawing a line to the scope of Python packaging

msarahan · April 24, 2019, 1:39pm

This was my point: having a need for things to be “carefully built” implies room for error and confusion when people are either careless or just not knowledgeable. The fact that it doesn’t really bite people is more a testament to the expertise of the people building the most common packages than proof that it just isn’t a problem.

So, should pip then also recognize and use apt/rpm/nix/apk/homebrew/… where appropriate? There’s special metadata that makes conda envs pretty recognizable. Is the same true for all the others? If conda is directly installing wheels, then is it recursively calling pip to figure out dependencies?

I said it was not reasonable for pip to understand native metadata, but rather that pip should have an extension point where native package managers would register their own helpers for keeping their native metadata in line with what pip has changed.

You’re assuming that it has to be part of the actual solve. I don’t think they really do (at least not in early phases of the solve). Use tools like “whatprovides” to match file names to package names.
Resolve package names, versions, dependency relationships next. Maybe add in readily identifiable things like C++ ABI and perhaps more well-understood things like glibc baseline requirement (which platform tags sort of do right now, but which I’m arguing would be better directly specified). Examining all symbols might be necessary to totally ensure compatibility/explain incompatibility, but it’s probably not.

While I appreciate the sentiment of getting wheels that work better with conda, the real goal should be getting wheels that work better with arbitrary external package managers. Can we do so in a way that does not explode the number of wheels necessary to cover the space? How much can we minimize additional work for package maintainers (and/or CI/CD services)? People want to solve this problem - the Tensorflow and Pytorch teams, for example, have really chafed on manylinux1 as a platform tag. Perhaps figuring this out is a better use of effort than coming up with new platform tags, which seems to be the current thrust of effort from a few sides (including your own). Would platform tags still be necessary at all? Perhaps only for different OS (MacOS, Linux, Win) and CPU type. I understand that you personally do not want to go down this rabbit hole, but why push back against other people exploring the space?

dstufft · April 24, 2019, 7:12pm

msarahan:

While I appreciate the sentiment of getting wheels that work better with conda, the real goal should be getting wheels that work better with arbitrary external package managers. Can we do so in a way that does not explode the number of wheels necessary to cover the space? How much can we minimize additional work for package maintainers (and/or CI/CD services)? People want to solve this problem - the Tensorflow and Pytorch teams, for example, have really chafed on manylinux1 as a platform tag. Perhaps figuring this out is a better use of effort than coming up with new platform tags, which seems to be the current thrust of effort from a few sides (including your own). Would platform tags still be necessary at all? Perhaps only for different OS (MacOS, Linux, Win) and CPU type. I understand that you personally do not want to go down this rabbit hole, but why push back against other people exploring the space?

I don’t think I understand what this is suggesting, and I think there is some confusion maybe? There is nothing inherent in Wheels that require static linking or whatever. That’s just the only way we have to satisfy non Python package dependencies currently. Generally I think pip is not going to become a general package manager, so those sort of things are out of scope. It’s possible it could get some sort of plugin system for system level package managers to provide dependencies that aren’t Python packages. It’s also possible we can just define more platforms for people, e.g. a conda platform could exist where wheels can depend on stuff that is available in conda, or a debian, ubuntu, rhel, etc.

msarahan · April 24, 2019, 10:31pm

dstufft:

msarahan:

While I appreciate the sentiment of getting wheels that work better with conda, the real goal should be getting wheels that work better with arbitrary external package managers. Can we do so in a way that does not explode the number of wheels necessary to cover the space? How much can we minimize additional work for package maintainers (and/or CI/CD services)? People want to solve this problem - the Tensorflow and Pytorch teams, for example, have really chafed on manylinux1 as a platform tag. Perhaps figuring this out is a better use of effort than coming up with new platform tags, which seems to be the current thrust of effort from a few sides (including your own). Would platform tags still be necessary at all? Perhaps only for different OS (MacOS, Linux, Win) and CPU type. I understand that you personally do not want to go down this rabbit hole, but why push back against other people exploring the space?

I don’t think I understand what this is suggesting, and I think there is some confusion maybe? There is nothing inherent in Wheels that require static linking or whatever. That’s just the only way we have to satisfy non Python package dependencies currently. Generally I think pip is not going to become a general package manager, so those sort of things are out of scope. It’s possible it could get some sort of plugin system for system level package managers to provide dependencies that aren’t Python packages. It’s also possible we can just define more platforms for people, e.g. a conda platform could exist where wheels can depend on stuff that is available in conda, or a debian, ubuntu, rhel, etc.

What this is suggesting is that in order to make a plugin system where external package providers could be used, there should be a unifying description of what dependency is needed. Defining more platforms is certainly a way to do that. I think it’s not a very good way. For conda in particular, you’re assuming that all conda libraries are the same platform. On linux, we have at least 2 - the old “free” channel, and the new “main” channel. Channels generally represent collections that are built with the same toolchain and are compatible.

A better, more explicit description of what external dependency is needed is what I’m getting at. A platform tag rolls up too much information in one value. Instead of having many wheels that express dependencies on specific package systems, I propose that there be many fewer wheels that express dependency on specific external libraries, instead. It is up to external plugins to then determine how/if they can satisfy the need for those libraries.

So, instead of external dependencies that look like:

conda:main:xyz>=1.2.11
conda:conda-forge:xyz>=1.2.11
rhel:6:libxyz
rhel:7:xyz
...

where undefined platforms are probably not supported at all, we could instead have:

libxyz.so.1, from reference project xyz, version 1.2.3, requiring a minimum glibc of 2.12, with c++ ABI 9

and then it’s up to conda, or yum, or apt, or whatever to say “oh, I have that in this package. Let me install it and help make sure that your python extension can find it” or “that’s not compatible with my glibc, I need to tell the user to try to find another source for this package.”

What’s the minimum amount of metadata that we can use to completely specify a library? I don’t know exactly. Anything that imposes a version constraint on some library provided by the external provider. I think what I posted above is a decent start on Linux. For MacOS, it may really be that platform tags as they are are good enough representations of the binary compatibility and MacOS version required at runtime. For Windows, python has been completely matched to particular VS versions. The new VS is much, much better in this regard, but it would still be nice to have libraries represent their minimal runtime requirements.

njs · April 24, 2019, 11:11pm

My suggestion is: if conda is able to get feature parity with pip install/pip uninstall, then it would make sense for pip to simply refer people to conda instead of going around and mangling data. In this approach, the actual change to pip would just be to (a) recognize conda environments, (b) a few lines of code to print “hey this is a conda env, you should run conda <whatever>, it works better”.

Pip’s dependency management is all based on documented standards, and large parts of it are already available as standalone python libraries. So in my vision, you’re not calling out to pip to resolve dependencies, you’re natively parsing the wheel metadata and then feeding it into your existing dependency resolver.

I don’t really know what “keeping their native metadata in line with what pip has changed” means, concretely. For all the package managers I know, this would just be “uh… this system is irrecoverably screwed, I guess we can make a note of that?”. Conda cares about interoperating with pip about two orders of magnitude more than any of these other systems do.

Why? What problems is this trying to solve, and why do you believe that it will solve them?

Like, I get it. The dream that if we could just feed the right metadata into the right magical package management system and have everything work together seamlessly – that’s sounds amazing!

But… the reason manylinux doesn’t let you rely on the system openssl is because popular linux vendors have genuine disagreements about the openssl ABI, and C++ ABIs, and all this stuff. That’s the hard fact that puts some pretty strict limits on how close that dream can come to reality. The only way to make binaries that work across systems are (a) shipping your own libraries, (b) building lots of vendor specific wheels. It seems like the best you can hope for from this super complex research project is that instead of needing 8 different vendor-specific wheels, you manage to merge some together and get 5 different vendor-specific wheels instead. Or… you could use manylinux and ship one wheel and be done with it.

Conda-specific wheels are interesting because there’s the additional possibility that conda could seamlessly track mixed wheel/conda installs, upgrade both together, etc. And because there’s a lot of demand from conda users for this. Other vendors could do this too of course, but in practice I don’t think anyone else is interested.

I just don’t want you all to go off down the rabbit hole chasing a dream that sounds fabulous but is ultimately impossible. It’s really easy to waste a lot of time and resources on that kind of thing.

msarahan · April 25, 2019, 1:44pm

njs:

My suggestion is: if conda is able to get feature parity with pip install / pip uninstall , then it would make sense for pip to simply refer people to conda instead of going around and mangling data. In this approach, the actual change to pip would just be to (a) recognize conda environments, (b) a few lines of code to print “hey this is a conda env, you should run conda <whatever> , it works better”.

Pip’s dependency management is all based on documented standards, and large parts of it are already available as standalone python libraries. So in my vision, you’re not calling out to pip to resolve dependencies, you’re natively parsing the wheel metadata and then feeding it into your existing dependency resolver.

I don’t really know what “keeping their native metadata in line with what pip has changed” means, concretely. For all the package managers I know, this would just be “uh… this system is irrecoverably screwed, I guess we can make a note of that?”. Conda cares about interoperating with pip about two orders of magnitude more than any of these other systems do.

I don’t really llike special-casing conda this way, but if that’s the best way here, I certainly think conda can and should add the PyPI metadata handling that you describe. The old issues of arbitrary code execution with setup.py from source-only distributions are still there, but wheels should be great (barring any incorrect dependency expressions).

For us on the conda side, the problems are just the bypass of the solver, which causes problems down the line. I think your proposed solution is fine for us.

For others who feel like they must publish wheels in order to provide their software, there’s a lot of frustration with manylinux and its lack of support for modern C++. One of my main hopes is to get metadata that directly represents the ways that a given wheel is not compatible with a particular system. manylinux is a great one-stop shop when people actually follow the standard. There’s many use cases (those around CUDA especially come to mind) that absolutely can’t follow the current manylinux options, either for technical reasons or for legal ones, but that isn’t stopping them from presenting their software on PyPI as manylinux. For the technical ones (generally, new C++ standard support), maybe the answer is newer manylinux images. Your perennial manylinux idea gets closer to something that lowers the bar to have new manylinux tags, but matching only glibc seems to ignore the C++ concerns (at least not address them directly), and still may have problems for more demanding software that needs newer glibc than the manylinux team is prepared to provide a standard for.

The thing I don’t like about this idea is that it imposes more work on the packager, and bifurcates the build process. What fraction of the scientific wheels ecosystem use static linking for their needs of compiled libraries? The analogous conda-using packages would need to be built a totally separate way to use conda-provided shared libraries instead. Is that worth people’s time? Maybe from the package consumer’s standpoint, but I don’t want to add that to the already often onerous task of packaging.

msarahan · April 25, 2019, 2:00pm

The detail I missed here is that you would probably use completely different toolchains to compile for conda than you would for manylinux. You may use very different compiler flags, too. It might make more sense to take an existing conda package and turn it into a wheel than trying to impose new builds on current wheel builders. That opens questions about who is doing the build, but those are social questions, not technical ones for this discussion.

pradyunsg · April 25, 2019, 2:27pm

Regarding the publishing of “not really manylinux” wheels as manylinux on PyPI, it was assumed when rolled out initially, that folks would be good citizens and not lie

Since that’s not the case, there’s a feature for PyPI under discussion on the issue tracker to disallow uploading such wheels. To be clear, such wheels are definitely something we want to not allow. IIUC, the current blocker is someone implementing this and figuring out the timelines.

https://github.com/pypa/warehouse/issues/5420

sumanah · June 24, 2019, 2:01pm

People discussed external dependencies during the packaging minisummit at PyCon North America in May 2019. Moving notes by @btskinn and @crwilcox here for easier searchability.

@msarahan championed this discussion:

Expression of dependencies on externally provided software (things that pip does not/should not provide). Metadata encompassing binary compatibility that may be required for these expressions.

[note from @tgamblin : "FWIW, Spack supports external dependencies, and has a fairly lightweight way for users to specify them (spec + path)

We do not auto-detect them (yet). We’ll likely tackle auto-detecting build dependencies (binaries) before we tackle libraries, where ABI issues come into play."]

What metadata can we add to increase understanding of compatibility?

a. Make metadata more direct
b. Check for libs rather than just provide them
c. manylinux1 does some things that move toward this
d. We don’t really consider different workflows separately. Could improve docs, guiding users to other tools, rather than defaulting to one tool that isn’t the right fit.
e. Can we design standards/interchange formats for interoperating between the tools? Should PyPA provide a standard?
f. A key goal is to avoid lock-in to PyPA tools
g. Tools for people who don’t want to rely on an external package manager for provisioning Pythons

I.e., yum, apt, brew, conda

h. Need to ensure appropriate bounds are placed on pip’s scope

Draft PEP @ncoghlan

Challenge is expressing dependencies in a way to expose actual runtime dependencies
- Particular dependencies of interest are (1) commands on PATH and (2) dynamic libraries on, e.g., LD_LIBRARY_PATH
For practical usage, automatic generation is likely needed → usually not human-readable
Developers don’t explicitly know these.
Audit wheel may generate these?

pip as user of the metadata spec

Once the metadata spec is written, pip could likely check for these dependencies, and fail if they’re not met
Unlikely pip could be made to meet (search-and-find) any missing dependencies
Where do we get the mapping(s) from dependency names ← → platform-localized platform names. → step 1: speclanguage

How hard would it be to have conda-forge provide its metadata?

@scopatz: not too hard and could be done. We can make this simpler by providing this for tools. PyPI can provide metadata. Conda even has this already and maybe repurposable

Are there other sources of inspiration/shared functionality?

Ansible
Chef (resources and providers)
End users are able to extend the list of providers should they be using a distribution or environment where the default providers do not match the running system

What about platforms other than Linux?

(Nick) Windows and MacOS are likely to be somewhat easier – much stricter ABIs, with bundling → less variation to cope with

Is there any way to shrink the scope of the problem?

Advanced devs usually can know what to do to fix a dependency problem based on a typical error message (‘command not found’ or ‘libxyz.so not found’).
The key thing is(?) to make it clearer to inexperienced users what the problem is -- capture the ‘exception’ and provide a better message

Should there be a focus on one type of dependency?

I.e., missing commands vs missing libraries
Probably not: Seems likely that solving one type will provide machinery to make solving the other relatively easy

Actions:

Draft a PEP (or update existing) to define a spec
First crack at integration between PyPI/pip + conda-forge
Start a thread for this topic to form a working group?

ncoghlan · June 25, 2019, 11:37am

Tennessee Leeuwenburg’s draft PEP mentioned in the above notes on external dependencies: https://github.com/pypa/interoperability-peps/pull/30

willingc · June 27, 2019, 8:15pm

Great summary Sumanah. Would it make sense to pin this post or transform as a wiki?

pombredanne · August 26, 2019, 2:45pm

Adding my 2 cents: I started this mini spec with several other folks at https://github.com/package-url to provide a platform neutral way to reference a package.

This has been adopted by a few tools and I also made this presentation at FOSDEM https://archive.fosdem.org/2018/schedule/event/purl/

I could see a way where the things that could be referenced in a Python package metadata would be external, non-provided dependencies without specifying how these would be provisioned/installed/etc, such as a dep on an npm, a RubyGem, an RPM or a Debian package. e.g. something along these lines:

External-Requires =
    pkg:npm/foobar@12.3.1
    pkg:npm/baz@12.3.1
    pkg:rpm/curl@7.50
    pkg:deb/libcurl3-gnutls

… where each Package URL references a package in another ecosystem… and we could then either:

have tools display that information if they cannot handle it
have smart tools that know how to install an npm, an rpm, a cargo crate or a debian package and would be able to do something with these
have smart tools that know how to check if such a dependency is present or not (even if they may not know what to do to install them)