Drawing a line to the scope of Python packaging

msarahan · April 24, 2019, 1:06am

How does pip track compatibility with software such as scikit-learn that is built against the numpy C API? Scikit-learn is nominally compatible back to numpy 1.8 (which is amazing), but in actuality, it works out to be whatever numpy version is used at compile time as a baseline. The compatibility of a package’s dependencies is often more complicated than just the python side of the story. Conda’s constraints work the same as pip’s, in terms of being a name/version range generally but I think the (data science/scientific) community is more used to considering binary compatibility in expressing their constraints. We have been guided especially by the excellent site at ABI Tracker: Tested libraries .
This consideration is critically important for the data science community, where compiled code tends to be more common. The scikit-learn developers hide this complexity from users by being careful to always compile against old numpy versions, but a new package contributor could easily miss this subtlety and claim compatibility where there isn’t actually compatibility in practice.

It is inaccurate to say that only pip “knows” something is broken. Conda can read in pip-installed metadata and act on it. This was added in conda 4.6 (January 2019). It can’t directly read metadata from PyPI (yet?)). Both conda and pip (and probably other package managers) know that some existing env is broken based on the same metadata, and conda has a bit more metadata for the lower-level packages that pip doesn’t currently express. I trust that pip’s solver, when implemented, will greatly improve how pip recognizes, prevents, and otherwise deals with brokenness.

I think that conda packages of python packages include enough standard metadata for pip to understand them natively, but that doesn’t include the conda-only metadata. It would be nice (but really not reasonable) for pip to help manage conda’s metadata in the same way that conda manages pip’s metadata. I say unreasonable because it’s definitely out of scope for pip, and not scalable to generalize to all other potential external sources of metadata. Pip operates with conda in the same way that pip operates within an operating system. Perhaps there should be a way that package managers can provide plugins for pip, such that pip could just call some hook, and any registered package managers for a given env/space could proceed to adjust their own metadata accordingly to match pip’s changes.

As much as possible, pip should not do things that make it impossible for other package managers controlling the same space to be consistent/correct. In other words, introducing packages that have conflicting constraints imposes an impossible problem on the external package managers. Once an environment is inconsistent, things start getting really strange and broken. This isn’t news to anyone, but if pip knows about creating inconsistencies, there really should be a way to make preventing inconsistencies the default behavior. People still need to be able to force inconsistencies, because sometimes dependency metadata is bad. There needs to be ways to fix it. We “hotfix” our index. I don’t know what the right answer might be for PyPI. Once bad metadata (e.g. an overly broad constraint) is available to a solver, it can be very hard to get sane answers without either altering metadata or removing problem packages.

I really don’t want to get into “conda this, pip that.” Metadata is key to all of us. The conda and pip (and spack and yum and apt and…) communities would both benefit from sharing better dependency data. I think this might be part of what Tidelift is trying to do. The metadata that I hope we can discuss at PyCon specifically is metadata that fully expresses binary dependencies. Conda does so only indirectly right now (standardizing on toolchains and effectively encoding this information into version constraints). I see platform tags as another indirect way to lay out compatibility in the same way. Any notion of external library dependencies in PyPI packages needs a reliable way to know what package provides the necessary dependency (yum whatprovides), and also a way to know that the necessary dependency is compatible with a specific compiled binary. Can we get to a finer-grained view of metadata that lets us understand that a pip package’s compiled extension needs xyz 1.2.3, which can be satisfied by a package on CentOS 6 or with conda, but not on CentOS 5 because a glibc symbol is missing, and not with Ubuntu 16.04 or Fedora 19 system libraries because a specific C++ ABI was used?