Drawing a line to the scope of Python packaging

I think it’s more that conda knows it’s broken and stops you, while pip has no idea because the specific build isn’t pinned, only a version range. (But every time this comes up it gets split into its own thread, so I’m not going to say any more - go read one of the other threads.)

You make a lot of good points. I could see that if python.org and packaging.python.org made it clear that people were choosing a particular approach to installing python when they download python.org and use only pip, that would be helpful.

I’m not sure it would convince all the package authors to not just tell people to “pip install” but perhaps what I should do is spend time convincing the “other ways to download and install Python” to also override “pip install” — given how prominent the notion of “pip install” is in every instruction set.

Although, even as I write it, I remember why we didn’t do that with conda (i.e. override pip) given that pip can be used for many workflows, I know it would be a recipe for maintenance nightmares. Of course, I suppose the replaced pip could just override the “install” command.

But, how do people feel about the idea I’ve heard Nick and others promote of using “python -m pip install” as the proper spelling of “install this package”.

“… because it’s better than pip install”? Or “and instead use ‘install this package’ to not over-promote pip”?

Please don’t. Ubuntu/Debian make some small tweaks to their version of pip compared to upstream, and it causes substantial confusion and problems, because users don’t know which version they’re using and when things go wrong no-one knows what’s going on or how to help. Replacing pip install entirely would be 10x worse.

Maybe it would be viable to have pip refuse to install into an environment that it knows is managed by some external package manager (like /usr on Linux, or a conda environment) unless some explicit override flag is passed?

Convincing package authors to change their instructions is a somewhat separate issue – the folks in this thread don’t have any direct control over what package authors put in their install instructions. And package authors are one of the audiences for whom pip has some substantial advantages over conda. Users who use pip install get exactly the package that the package author uploaded. As soon as they make a release, pip users immediately have access to their package. If users have problems, the package author can help them. If you want the users to try out a pre-release to see if it fixes their problem, that’s a heck of a lot easier when the user is using pip. If you add conda as an intermediary, this will in many cases make things better for the user, no question, but the benefits to the authors are much smaller and often negative. So… if you want to convince them to change their instructions, you might need to figure out how to fix that.

2 Likes

I think we do. If we specify guidelines on https://packaging.python.org/ with a detailed explanation of why we suggest a given way I think package maintainers will adopt and follow.

Very dangerous because it would be backwards incompatible. For example, installing under this from within docker is fine.

I would hampion for pip having a mode to install some stuff as pipx does (virtualenv/isolated/user level) . This way all Python tools (black/flake8/cookiecutter/etc) should be recommended to be installed under this mode preserving the sanity of the global site-package.

2 Likes

How does pip track compatibility with software such as scikit-learn that is built against the numpy C API? Scikit-learn is nominally compatible back to numpy 1.8 (which is amazing), but in actuality, it works out to be whatever numpy version is used at compile time as a baseline. The compatibility of a package’s dependencies is often more complicated than just the python side of the story. Conda’s constraints work the same as pip’s, in terms of being a name/version range generally but I think the (data science/scientific) community is more used to considering binary compatibility in expressing their constraints. We have been guided especially by the excellent site at ABI Tracker: Tested libraries .
This consideration is critically important for the data science community, where compiled code tends to be more common. The scikit-learn developers hide this complexity from users by being careful to always compile against old numpy versions, but a new package contributor could easily miss this subtlety and claim compatibility where there isn’t actually compatibility in practice.

It is inaccurate to say that only pip “knows” something is broken. Conda can read in pip-installed metadata and act on it. This was added in conda 4.6 (January 2019). It can’t directly read metadata from PyPI (yet?)). Both conda and pip (and probably other package managers) know that some existing env is broken based on the same metadata, and conda has a bit more metadata for the lower-level packages that pip doesn’t currently express. I trust that pip’s solver, when implemented, will greatly improve how pip recognizes, prevents, and otherwise deals with brokenness.

I think that conda packages of python packages include enough standard metadata for pip to understand them natively, but that doesn’t include the conda-only metadata. It would be nice (but really not reasonable) for pip to help manage conda’s metadata in the same way that conda manages pip’s metadata. I say unreasonable because it’s definitely out of scope for pip, and not scalable to generalize to all other potential external sources of metadata. Pip operates with conda in the same way that pip operates within an operating system. Perhaps there should be a way that package managers can provide plugins for pip, such that pip could just call some hook, and any registered package managers for a given env/space could proceed to adjust their own metadata accordingly to match pip’s changes.

As much as possible, pip should not do things that make it impossible for other package managers controlling the same space to be consistent/correct. In other words, introducing packages that have conflicting constraints imposes an impossible problem on the external package managers. Once an environment is inconsistent, things start getting really strange and broken. This isn’t news to anyone, but if pip knows about creating inconsistencies, there really should be a way to make preventing inconsistencies the default behavior. People still need to be able to force inconsistencies, because sometimes dependency metadata is bad. There needs to be ways to fix it. We “hotfix” our index. I don’t know what the right answer might be for PyPI. Once bad metadata (e.g. an overly broad constraint) is available to a solver, it can be very hard to get sane answers without either altering metadata or removing problem packages.

I really don’t want to get into “conda this, pip that.” Metadata is key to all of us. The conda and pip (and spack and yum and apt and…) communities would both benefit from sharing better dependency data. I think this might be part of what Tidelift is trying to do. The metadata that I hope we can discuss at PyCon specifically is metadata that fully expresses binary dependencies. Conda does so only indirectly right now (standardizing on toolchains and effectively encoding this information into version constraints). I see platform tags as another indirect way to lay out compatibility in the same way. Any notion of external library dependencies in PyPI packages needs a reliable way to know what package provides the necessary dependency (yum whatprovides), and also a way to know that the necessary dependency is compatible with a specific compiled binary. Can we get to a finer-grained view of metadata that lets us understand that a pip package’s compiled extension needs xyz 1.2.3, which can be satisfied by a package on CentOS 6 or with conda, but not on CentOS 5 because a glibc symbol is missing, and not with Ubuntu 16.04 or Fedora 19 system libraries because a specific C++ ABI was used?

1 Like

I would expect the resulting wheel to have a requirement on >=$NUMPY_I_BUILT_AGAINST. There’s no requirement that the build depends and install depends have the same version range.

Yes. Is there an expression in setup.py or requirements.txt for that? Does pip know about this in creating wheels? Should it? What would a solver do when presented with a build requirement of >=1.8? Your post from ages ago on “setup.py vs requirements.txt” (setup.py vs requirements.txt · caremad) was and is excellent, but an awful lot of people still only provide requirements.txt anyway. On top of that, it appears that only pyproject.toml supports build requirements, too? No way to specify build-time dependencies of setup.py · Issue #2381 · pypa/pip · GitHub - I suppose several projects have found hacks to achieve it with setup.py. The problem is not one of “are there ways to do this right?” but one of “how easy is it to do it the right way?” where the “right way” is defined as packager doesn’t give up in frustration before the package is built, the package “just works” wherever possible, and provides meaningful feedback about why it won’t work otherwise.

Are there established community guidelines for how to understand and correctly handle situations where binary compatibility come into play? On the build side? On the user install side?

These are all things that can be done with any tool in any ecosystem, and I don’t really mean to say “conda’s better because this” - I mean to point out particular workflows that require extra attention, and which would benefit from additional metadata.

For the numpy case specifically, the correct thing to do is for people who use the binary ABI to put something like this in their setup.py:

install_requires=[
    "numpy >= " + ".".join(np.__version__.split(".")[:2]),
    ...
]

This is sort of silly – numpy ought to provide a helper, so you could write install_requires=[np.get_abi_version_constraint(), ...] or something. And I don’t think anyone actually does this correctly right now.

But the actual constraints are totally specifiable in the package metadata – the issues are just that we don’t make it easy to generate the correct metadata. So it’s a build system problem, not a package format problem.

In practice this almost never bites people, because either they’re using a pre-compiled wheel from PyPI that was carefully built to have the correct install-requires metadata, or else they’re building locally against whatever version of numpy they’re using, and people rarely downgrade numpy. It’d be nice to fix but it’s not currently a big pain point AFAICT.

Now, part of the reason why this works, is because numpy has intentionally designed its binary compatibility guarantees to make it work. If packaging metadata was more flexible/powerful, then numpy would have more flexibility about how to evolve its ABI. At some point I came up with an idea to handle this, by allowing what Debian calls “virtual packages” – so e.g. numpy 1.16 might say “I can also fulfill requirements for the packages ‘numpy[abi-1]’ and ‘numpy[abi-2]’”, and then packages built against numpy would declare that they required ‘numpy[abi-2]’ to be available, and numpy could potentially drop support for old ABIs over time. There are probably other ways too.

Practically speaking, I suspect it will be easier to teach conda how to download and install wheels from PyPI natively, and make pip say “hey, this looks like a conda env, you should install wheels using conda”, then it will be for pip to figure out how to natively manipulate metadata for (conda, apt, rpm, nix, apk, homebrew, …).

This sounds like some really important experience, and I’d love to hear more about it. Ideally before pip’s resolver is implemented and we get to rediscover the issues from scratch :-). If you’re up for it, maybe you could start a new thread to share some war stories?

I’ve always pushed back against this, because inventing a tool that can compute whether two arbitrary binaries are ABI-compatible on the fly is a major research project. And even if you had it, it’s not at all clear to me that it would help anyway – even if you know that there might be some library somewhere in your package repo that has the symbols this wheel wants, how can you find that binary? The end goal is to compile these abstract ABI constraints into some expression in the underlying system’s package language, that rpm or conda or whatever can understand… but they don’t work like that; they want package names and version constraints.

So IMO the simpler and more useful approach is to make it possible for a wheel to say: “here’s a list of conda packages and version constraints that this wheel needs”. Of course now that wheel only works on conda, but on the other hand… it will work on conda! no open-ended research project required :slight_smile:

This idea is also being discussed in more detail at Support for build-and-run-time dependencies - #37 by jdemeyer (@njs: no I’m not trying to hijack this thread, this is a good-faith attempt to move the ABI-compatibility discussion in one place).

This was my point: having a need for things to be “carefully built” implies room for error and confusion when people are either careless or just not knowledgeable. The fact that it doesn’t really bite people is more a testament to the expertise of the people building the most common packages than proof that it just isn’t a problem.

So, should pip then also recognize and use apt/rpm/nix/apk/homebrew/… where appropriate? There’s special metadata that makes conda envs pretty recognizable. Is the same true for all the others? If conda is directly installing wheels, then is it recursively calling pip to figure out dependencies?

I said it was not reasonable for pip to understand native metadata, but rather that pip should have an extension point where native package managers would register their own helpers for keeping their native metadata in line with what pip has changed.

You’re assuming that it has to be part of the actual solve. I don’t think they really do (at least not in early phases of the solve). Use tools like “whatprovides” to match file names to package names.
Resolve package names, versions, dependency relationships next. Maybe add in readily identifiable things like C++ ABI and perhaps more well-understood things like glibc baseline requirement (which platform tags sort of do right now, but which I’m arguing would be better directly specified). Examining all symbols might be necessary to totally ensure compatibility/explain incompatibility, but it’s probably not.

While I appreciate the sentiment of getting wheels that work better with conda, the real goal should be getting wheels that work better with arbitrary external package managers. Can we do so in a way that does not explode the number of wheels necessary to cover the space? How much can we minimize additional work for package maintainers (and/or CI/CD services)? People want to solve this problem - the Tensorflow and Pytorch teams, for example, have really chafed on manylinux1 as a platform tag. Perhaps figuring this out is a better use of effort than coming up with new platform tags, which seems to be the current thrust of effort from a few sides (including your own). Would platform tags still be necessary at all? Perhaps only for different OS (MacOS, Linux, Win) and CPU type. I understand that you personally do not want to go down this rabbit hole, but why push back against other people exploring the space?

1 Like

I don’t think I understand what this is suggesting, and I think there is some confusion maybe? There is nothing inherent in Wheels that require static linking or whatever. That’s just the only way we have to satisfy non Python package dependencies currently. Generally I think pip is not going to become a general package manager, so those sort of things are out of scope. It’s possible it could get some sort of plugin system for system level package managers to provide dependencies that aren’t Python packages. It’s also possible we can just define more platforms for people, e.g. a conda platform could exist where wheels can depend on stuff that is available in conda, or a debian, ubuntu, rhel, etc.

What this is suggesting is that in order to make a plugin system where external package providers could be used, there should be a unifying description of what dependency is needed. Defining more platforms is certainly a way to do that. I think it’s not a very good way. For conda in particular, you’re assuming that all conda libraries are the same platform. On linux, we have at least 2 - the old “free” channel, and the new “main” channel. Channels generally represent collections that are built with the same toolchain and are compatible.

A better, more explicit description of what external dependency is needed is what I’m getting at. A platform tag rolls up too much information in one value. Instead of having many wheels that express dependencies on specific package systems, I propose that there be many fewer wheels that express dependency on specific external libraries, instead. It is up to external plugins to then determine how/if they can satisfy the need for those libraries.

So, instead of external dependencies that look like:

conda:main:xyz>=1.2.11
conda:conda-forge:xyz>=1.2.11
rhel:6:libxyz
rhel:7:xyz
...

where undefined platforms are probably not supported at all, we could instead have:

libxyz.so.1, from reference project xyz, version 1.2.3, requiring a minimum glibc of 2.12, with c++ ABI 9

and then it’s up to conda, or yum, or apt, or whatever to say “oh, I have that in this package. Let me install it and help make sure that your python extension can find it” or “that’s not compatible with my glibc, I need to tell the user to try to find another source for this package.”

What’s the minimum amount of metadata that we can use to completely specify a library? I don’t know exactly. Anything that imposes a version constraint on some library provided by the external provider. I think what I posted above is a decent start on Linux. For MacOS, it may really be that platform tags as they are are good enough representations of the binary compatibility and MacOS version required at runtime. For Windows, python has been completely matched to particular VS versions. The new VS is much, much better in this regard, but it would still be nice to have libraries represent their minimal runtime requirements.

My suggestion is: if conda is able to get feature parity with pip install/pip uninstall, then it would make sense for pip to simply refer people to conda instead of going around and mangling data. In this approach, the actual change to pip would just be to (a) recognize conda environments, (b) a few lines of code to print “hey this is a conda env, you should run conda <whatever>, it works better”.

Pip’s dependency management is all based on documented standards, and large parts of it are already available as standalone python libraries. So in my vision, you’re not calling out to pip to resolve dependencies, you’re natively parsing the wheel metadata and then feeding it into your existing dependency resolver.

I don’t really know what “keeping their native metadata in line with what pip has changed” means, concretely. For all the package managers I know, this would just be “uh… this system is irrecoverably screwed, I guess we can make a note of that?”. Conda cares about interoperating with pip about two orders of magnitude more than any of these other systems do.

Why? What problems is this trying to solve, and why do you believe that it will solve them?

Like, I get it. The dream that if we could just feed the right metadata into the right magical package management system and have everything work together seamlessly – that’s sounds amazing!

But… the reason manylinux doesn’t let you rely on the system openssl is because popular linux vendors have genuine disagreements about the openssl ABI, and C++ ABIs, and all this stuff. That’s the hard fact that puts some pretty strict limits on how close that dream can come to reality. The only way to make binaries that work across systems are (a) shipping your own libraries, (b) building lots of vendor specific wheels. It seems like the best you can hope for from this super complex research project is that instead of needing 8 different vendor-specific wheels, you manage to merge some together and get 5 different vendor-specific wheels instead. Or… you could use manylinux and ship one wheel and be done with it.

Conda-specific wheels are interesting because there’s the additional possibility that conda could seamlessly track mixed wheel/conda installs, upgrade both together, etc. And because there’s a lot of demand from conda users for this. Other vendors could do this too of course, but in practice I don’t think anyone else is interested.

I just don’t want you all to go off down the rabbit hole chasing a dream that sounds fabulous but is ultimately impossible. It’s really easy to waste a lot of time and resources on that kind of thing.

I don’t really llike special-casing conda this way, but if that’s the best way here, I certainly think conda can and should add the PyPI metadata handling that you describe. The old issues of arbitrary code execution with setup.py from source-only distributions are still there, but wheels should be great (barring any incorrect dependency expressions).

For us on the conda side, the problems are just the bypass of the solver, which causes problems down the line. I think your proposed solution is fine for us.

For others who feel like they must publish wheels in order to provide their software, there’s a lot of frustration with manylinux and its lack of support for modern C++. One of my main hopes is to get metadata that directly represents the ways that a given wheel is not compatible with a particular system. manylinux is a great one-stop shop when people actually follow the standard. There’s many use cases (those around CUDA especially come to mind) that absolutely can’t follow the current manylinux options, either for technical reasons or for legal ones, but that isn’t stopping them from presenting their software on PyPI as manylinux. For the technical ones (generally, new C++ standard support), maybe the answer is newer manylinux images. Your perennial manylinux idea gets closer to something that lowers the bar to have new manylinux tags, but matching only glibc seems to ignore the C++ concerns (at least not address them directly), and still may have problems for more demanding software that needs newer glibc than the manylinux team is prepared to provide a standard for.

The thing I don’t like about this idea is that it imposes more work on the packager, and bifurcates the build process. What fraction of the scientific wheels ecosystem use static linking for their needs of compiled libraries? The analogous conda-using packages would need to be built a totally separate way to use conda-provided shared libraries instead. Is that worth people’s time? Maybe from the package consumer’s standpoint, but I don’t want to add that to the already often onerous task of packaging.

1 Like

The detail I missed here is that you would probably use completely different toolchains to compile for conda than you would for manylinux. You may use very different compiler flags, too. It might make more sense to take an existing conda package and turn it into a wheel than trying to impose new builds on current wheel builders. That opens questions about who is doing the build, but those are social questions, not technical ones for this discussion.

Regarding the publishing of “not really manylinux” wheels as manylinux on PyPI, it was assumed when rolled out initially, that folks would be good citizens and not lie

Since that’s not the case, there’s a feature for PyPI under discussion on the issue tracker to disallow uploading such wheels. To be clear, such wheels are definitely something we want to not allow. IIUC, the current blocker is someone implementing this and figuring out the timelines.

https://github.com/pypa/warehouse/issues/5420

4 Likes

People discussed external dependencies during the packaging minisummit at PyCon North America in May 2019. Moving notes by @btskinn and @crwilcox here for easier searchability.

@msarahan championed this discussion:

Expression of dependencies on externally provided software (things that pip does not/should not provide). Metadata encompassing binary compatibility that may be required for these expressions.

[note from @tgamblin : "FWIW, Spack supports external dependencies, and has a fairly lightweight way for users to specify them (spec + path)

https://spack.readthedocs.io/en/latest/build_settings.html#external-packages

We do not auto-detect them (yet). We’ll likely tackle auto-detecting build dependencies (binaries) before we tackle libraries, where ABI issues come into play."]

What metadata can we add to increase understanding of compatibility?

a. Make metadata more direct
b. Check for libs rather than just provide them
c. manylinux1 does some things that move toward this
d. We don’t really consider different workflows separately. Could improve docs, guiding users to other tools, rather than defaulting to one tool that isn’t the right fit.
e. Can we design standards/interchange formats for interoperating between the tools? Should PyPA provide a standard?
f. A key goal is to avoid lock-in to PyPA tools
g. Tools for people who don’t want to rely on an external package manager for provisioning Pythons

  • I.e., yum, apt, brew, conda

h. Need to ensure appropriate bounds are placed on pip’s scope

Draft PEP @ncoghlan

  • Challenge is expressing dependencies in a way to expose actual runtime dependencies

    • Particular dependencies of interest are (1) commands on PATH and (2) dynamic libraries on, e.g., LD_LIBRARY_PATH
  • For practical usage, automatic generation is likely needed → usually not human-readable

  • Developers don’t explicitly know these.

  • Audit wheel may generate these?

pip as user of the metadata spec

  • Once the metadata spec is written, pip could likely check for these dependencies, and fail if they’re not met

  • Unlikely pip could be made to meet (search-and-find) any missing dependencies

  • Where do we get the mapping(s) from dependency names ← → platform-localized platform names. → step 1: speclanguage

How hard would it be to have conda-forge provide its metadata?

  • @scopatz: not too hard and could be done. We can make this simpler by providing this for tools. PyPI can provide metadata. Conda even has this already and maybe repurposable

Are there other sources of inspiration/shared functionality?

  • Ansible
  • Chef (resources and providers)
  • End users are able to extend the list of providers should they be using a distribution or environment where the default providers do not match the running system

What about platforms other than Linux?

  • (Nick) Windows and MacOS are likely to be somewhat easier – much stricter ABIs, with bundling → less variation to cope with

Is there any way to shrink the scope of the problem?

  • Advanced devs usually can know what to do to fix a dependency problem based on a typical error message (‘command not found’ or ‘libxyz.so not found’).
  • The key thing is(?) to make it clearer to inexperienced users what the problem is -- capture the ‘exception’ and provide a better message

Should there be a focus on one type of dependency?

  • I.e., missing commands vs missing libraries
  • Probably not: Seems likely that solving one type will provide machinery to make solving the other relatively easy

Actions:

  • Draft a PEP (or update existing) to define a spec
  • First crack at integration between PyPI/pip + conda-forge
  • Start a thread for this topic to form a working group?
2 Likes

Tennessee Leeuwenburg’s draft PEP mentioned in the above notes on external dependencies: https://github.com/pypa/interoperability-peps/pull/30

1 Like

Great summary Sumanah. Would it make sense to pin this post or transform as a wiki?

1 Like