Drawing a line to the scope of Python packaging

pf_moore · April 10, 2019, 6:51pm

That sounds right to me. It’s quite possible that one of the initial guides on packaging.python.org could give a more complete and even discussion of the various options available. However:

Users in my experience don’t want unbiased lists of options, they want opinionated guides.
Nobody reads guides when they start off anyway, so you’ll probably still be getting people who’ve made an initial decision.
We’ll always need to ship some packaging tool with Python (if only to avoid the “how to install the packaging tool” problem), so that tool (whether it’s pip or some successor) will always have an advantage as the easiest option to start with.

Nevertheless, better explanation of the available options and their costs and benefits would definitely be good.

teoliphant · April 10, 2019, 9:25pm

It is very unlikely for Anaconda to do this as this would confuse its customers and users. However, conda-forge could create a Python-installer that had conda-forge as the default. That would be cool to see.

teoliphant · April 10, 2019, 9:28pm

I completely agree with this framing as well. @msarahan and @pf_moore describe what is needed well.

teoliphant · April 10, 2019, 10:02pm

Thank you @steve.dower. This is interesting feedback and in line with what I have seen as well. The fact that “pip” is the easiest thing for the new person to reach for means that all they think about is using pip for install. This invariably means that pip will be pressured to be a general-purpose packaging solution (it will slouch towards it based on what appears to be just following what users want).

The problem is that there is a space for a user-level, cross-language package manager like conda and there always will be. This channel is about drawing a line to the scope of Python packaging. Will pip be used to package and install Python itself? Will pip be used to package and install Java? Julia? I believe the answer should be no.

Then, if that is the case, because Python is used to “glue together” so many other languages there must be language at packaging.python.org that helps people understand that you should not expect ‘pip install’ to be the only way to install every Python package. Perhaps it can be used to install the ‘python-parts’ of the package but some of the things that must be installed for the solution to work should be installed by other package managers.

If we can agree on that framing (or something similar and better articulated), then we can have a conversation. Right now, what I see is that people are making “pip installable” things that make it much more difficult to actually provide a working and reproducible environment using tools that were built for that purpose. I’m not sure why people are doing that rather than build packages using tools that let them install them — other than the branding of packaging.python.org and its apparent message that everything should be “pip installable”

That won’t be able to provide what a user will expect until pip install can also install every other run-time that Python solutions glue together. For example, think about the pip install pyspark that happens right now. What does it do? Does it install Java (which is necessary for it to work)? It doesn’t as far as I can tell. Other Python packages are like this too and should be like this (they need previously installed things in order to work).

Is there a mechanism for pip to check for these previously installed things and raise an error or warning if they aren’t there? Perhaps that is a feature that could be added which would also implicitly help people understand the scope of “pip install”. All I’m suggesting is to do that plus a bit of modification to packaging.python.org in order to point to the efforts of other communities like NumFOCUS and conda-forge that are solving the general-purpose install problem.

Thanks for the feedback and help understanding other points of view. And just in case it’s not clear, I’m incredibly impressed and grateful for all the hard-work that goes into the open-source and community-centric solutions that you are all providing. Please just take my recommendations as a particular point of view from the trenches.

Thanks,

-Travis

brettcannon · April 10, 2019, 10:44pm

Or to tie in the analogy that @steve.dower brought up, conda works as a view on top of package versions to guarantee they all work together. So if you installed stuff with pip before that conda intall tensorflow then you would break assumptions conda makes about controlling all dependency versions to make sure they work with tensorflow (e.g. conda’s tensorflow might be tied to a different version of numpy than the one you installed or is even available on PyPI).

teoliphant · April 11, 2019, 5:01am

I had to wait until I could post a reponse to this because of the 3-replies limit that the forum puts in place.

Yeah, I think that’s right. More complete and wholistic messaging on packaging.python.org would be best. It doesn’t even have to be conda specific, but it could be one of the package managers mentioned.

And then, I’d like to see the conda-community and PyPA talk more in general. But that is mostly on the conda community at this point.

Thanks for pointing this out. I’m sorry if it came off more strongly than I meant it. Certainly it was not intended to be confrontational. It is true that I was challenging some assumptions people have (but I also welcome people challenging my assumptions). To be clear, it’s not just conda-forge that people can contribute to (brew, nixOS, chocolatey, apt-get, rpm all have packaging communities that would help).

All of my comments, though, are meant in a spirit of conversation and collaboration. I really apologize if it didn’t come off that way. I also recognize people will ultimately have different use-cases and needs and therefore different results. This can be the beauty and robustness of community.

I also believe the PyPA has done an incredible job of improving things in the Python community. I only emphasize that it should continue to be very careful about defining standards and limiting the scope of its standards — especially when there are already other solutions to the problems being solved.

I recognize this is hard for a volunteer community because it is easier to recruit volunteers to do things they like to do (like write code that solves problems they specifically have or someone they know has). Intentionally gathering feedback from people you aren’t hearing from and integrating roadmaps based on personas and as many stake-holders as you can as well as existing technology is what product managers typically try to do. I’m supportive of efforts to fund product managers for open-source communities.

brettcannon:

How can we move forward?

The problem I’m hearing is this thread has become a “conda versus pip” discussion and that has never turned out well no matter how many times I’ve heard it. Both tools have benefits and drawbacks and neither solve everyone’s problems perfectly (and this is speaking as someone who manages a team who has to support both tools in a code editor and so I see issues both beginners and advanced users have with both tools).

I personally think that the only way we will ever move past this past this issue is to get the stakeholders in a room and have a discussion about how we would want packaging to work if we were to all start from scratch based on what we know now. How do we layer it, handle external dependencies, etc.? Then we can talk about how to move the community towards that idealized goal (and I believe there plans to have such a discussion at PyCon US this year).

But from my view, arguing either side should move to the other isn’t going to get us anywhere.

I definitely agree that this is not about an “us vs them” and if I sound like that, I am sorry. Both conda and pip have their uses and while there is overlap they cannot replace each other. In fact, I don’t think they should, but hope that the PyPA understands that little by little pip will need to become a general purpose package manager (thereby enabling people to replace conda entirely) unless it limits its scope.

My repeated suggestion is that some people currently using pip because that is what they are told to use by the PyPA would be better served by using a general-purpose packaging solution like conda (or spack or brew or yum or apt-get or …) and that is a useful thing for the PyPA to acknowledge on packaging.python.org.

Thanks for the feedback.

pf_moore · April 11, 2019, 8:41am

I think the answer to that is clearly and self-evidently “no”. I don’t think anyone imagines otherwise - although it’s possible that not everyone draws the same conclusion from that inference that you’re suggesting.

I think that’s a fair suggestion in isolation. But it still avoids the question of how tools work together. If I’ve installed Python using the system package manager (on Windows, the python.org installer) and then used pipenv to set up my application development workflow, and as a consequence used pip to install several packages, and I now want to install something that “should be installed by another package manager” (which may be conda, but could be something else that sits in the same space as conda), we’re currently in a position where that’s not possible, and the developer has to unwind all the way back to installing Python with conda, and then looking for a conda-compatible equivalent to pipenv for their workflow. My contention is that doing so isn’t a practical option for the majority of people, and so we have 2 options:

Tell those people that they can’t use the package that they are trying to use.
Offer an option to use that package via pip (or any other toolchain that does work without that “rewind”) - probably with caveats that there may be integration issues, and those will be down to the user to address for themselves as their chosen toolchain doesn’t have the means to manage the issues automatically.

The problem I have is that at the moment we’re talking about helping people to make informed decisions right at the start of their projects, but glossing over the fact that the technical limitations of the tools mean that we’re expecting them to make difficult-to-change decisions before they have the information needed (specifically, what packages will they be using) to actually make those decisions.

IMO, advising new users to start with pip is only a problem if we don’t have a gradual migration path to tools that address more advanced issues. And the struggle with conda is that it doesn’t have that gradual migration path - so it looks like there’s a bias against conda, when in actual fact all there is, is an acknowledgement that new users may not need something as powerful as conda yet.

I’m afraid I think this is an issue that the conda community really need to solve themselves - how to provide a more gradual migration path for existing pip/pipenv/poetry/etc users. Once that path exists, I think that documenting and promoting it would be very easy to integrate into packaging.python.org.

brettcannon · April 11, 2019, 8:49pm

3 posts were split to a new topic: How to help people migrating from pip to conda?

jdemeyer · April 12, 2019, 1:17pm

Isn’t that the same with pip? It can also happen (without conda) that pip install --upgrade numpy upgrades numpy to a version that is no longer compatible with the version of scipy you had.

uranusjr · April 12, 2019, 1:34pm

The difference is pip actually knows something is broken, only chooses not to stop you (only emits a warning). I believe backwards compatibility is one of the motivations behind this behaviour. Conda cannot do the same due to lack of metadata compatibility.

steve.dower · April 12, 2019, 2:43pm

I think it’s more that conda knows it’s broken and stops you, while pip has no idea because the specific build isn’t pinned, only a version range. (But every time this comes up it gets split into its own thread, so I’m not going to say any more - go read one of the other threads.)

teoliphant · April 12, 2019, 11:03pm

You make a lot of good points. I could see that if python.org and packaging.python.org made it clear that people were choosing a particular approach to installing python when they download python.org and use only pip, that would be helpful.

I’m not sure it would convince all the package authors to not just tell people to “pip install” but perhaps what I should do is spend time convincing the “other ways to download and install Python” to also override “pip install” — given how prominent the notion of “pip install” is in every instruction set.

Although, even as I write it, I remember why we didn’t do that with conda (i.e. override pip) given that pip can be used for many workflows, I know it would be a recipe for maintenance nightmares. Of course, I suppose the replaced pip could just override the “install” command.

But, how do people feel about the idea I’ve heard Nick and others promote of using “python -m pip install” as the proper spelling of “install this package”.

brettcannon · April 12, 2019, 11:34pm

“… because it’s better than pip install”? Or “and instead use ‘install this package’ to not over-promote pip”?

njs · April 12, 2019, 11:59pm

Please don’t. Ubuntu/Debian make some small tweaks to their version of pip compared to upstream, and it causes substantial confusion and problems, because users don’t know which version they’re using and when things go wrong no-one knows what’s going on or how to help. Replacing pip install entirely would be 10x worse.

Maybe it would be viable to have pip refuse to install into an environment that it knows is managed by some external package manager (like /usr on Linux, or a conda environment) unless some explicit override flag is passed?

Convincing package authors to change their instructions is a somewhat separate issue – the folks in this thread don’t have any direct control over what package authors put in their install instructions. And package authors are one of the audiences for whom pip has some substantial advantages over conda. Users who use pip install get exactly the package that the package author uploaded. As soon as they make a release, pip users immediately have access to their package. If users have problems, the package author can help them. If you want the users to try out a pre-release to see if it fixes their problem, that’s a heck of a lot easier when the user is using pip. If you add conda as an intermediary, this will in many cases make things better for the user, no question, but the benefits to the authors are much smaller and often negative. So… if you want to convince them to change their instructions, you might need to figure out how to fix that.

bernatgabor · April 15, 2019, 8:54am

I think we do. If we specify guidelines on https://packaging.python.org/ with a detailed explanation of why we suggest a given way I think package maintainers will adopt and follow.

Very dangerous because it would be backwards incompatible. For example, installing under this from within docker is fine.

I would hampion for pip having a mode to install some stuff as pipx does (virtualenv/isolated/user level) . This way all Python tools (black/flake8/cookiecutter/etc) should be recommended to be installed under this mode preserving the sanity of the global site-package.

msarahan · April 24, 2019, 1:06am

How does pip track compatibility with software such as scikit-learn that is built against the numpy C API? Scikit-learn is nominally compatible back to numpy 1.8 (which is amazing), but in actuality, it works out to be whatever numpy version is used at compile time as a baseline. The compatibility of a package’s dependencies is often more complicated than just the python side of the story. Conda’s constraints work the same as pip’s, in terms of being a name/version range generally but I think the (data science/scientific) community is more used to considering binary compatibility in expressing their constraints. We have been guided especially by the excellent site at ABI Tracker: Tested libraries .
This consideration is critically important for the data science community, where compiled code tends to be more common. The scikit-learn developers hide this complexity from users by being careful to always compile against old numpy versions, but a new package contributor could easily miss this subtlety and claim compatibility where there isn’t actually compatibility in practice.

It is inaccurate to say that only pip “knows” something is broken. Conda can read in pip-installed metadata and act on it. This was added in conda 4.6 (January 2019). It can’t directly read metadata from PyPI (yet?)). Both conda and pip (and probably other package managers) know that some existing env is broken based on the same metadata, and conda has a bit more metadata for the lower-level packages that pip doesn’t currently express. I trust that pip’s solver, when implemented, will greatly improve how pip recognizes, prevents, and otherwise deals with brokenness.

I think that conda packages of python packages include enough standard metadata for pip to understand them natively, but that doesn’t include the conda-only metadata. It would be nice (but really not reasonable) for pip to help manage conda’s metadata in the same way that conda manages pip’s metadata. I say unreasonable because it’s definitely out of scope for pip, and not scalable to generalize to all other potential external sources of metadata. Pip operates with conda in the same way that pip operates within an operating system. Perhaps there should be a way that package managers can provide plugins for pip, such that pip could just call some hook, and any registered package managers for a given env/space could proceed to adjust their own metadata accordingly to match pip’s changes.

As much as possible, pip should not do things that make it impossible for other package managers controlling the same space to be consistent/correct. In other words, introducing packages that have conflicting constraints imposes an impossible problem on the external package managers. Once an environment is inconsistent, things start getting really strange and broken. This isn’t news to anyone, but if pip knows about creating inconsistencies, there really should be a way to make preventing inconsistencies the default behavior. People still need to be able to force inconsistencies, because sometimes dependency metadata is bad. There needs to be ways to fix it. We “hotfix” our index. I don’t know what the right answer might be for PyPI. Once bad metadata (e.g. an overly broad constraint) is available to a solver, it can be very hard to get sane answers without either altering metadata or removing problem packages.

I really don’t want to get into “conda this, pip that.” Metadata is key to all of us. The conda and pip (and spack and yum and apt and…) communities would both benefit from sharing better dependency data. I think this might be part of what Tidelift is trying to do. The metadata that I hope we can discuss at PyCon specifically is metadata that fully expresses binary dependencies. Conda does so only indirectly right now (standardizing on toolchains and effectively encoding this information into version constraints). I see platform tags as another indirect way to lay out compatibility in the same way. Any notion of external library dependencies in PyPI packages needs a reliable way to know what package provides the necessary dependency (yum whatprovides), and also a way to know that the necessary dependency is compatible with a specific compiled binary. Can we get to a finer-grained view of metadata that lets us understand that a pip package’s compiled extension needs xyz 1.2.3, which can be satisfied by a package on CentOS 6 or with conda, but not on CentOS 5 because a glibc symbol is missing, and not with Ubuntu 16.04 or Fedora 19 system libraries because a specific C++ ABI was used?

dstufft · April 24, 2019, 2:24am

I would expect the resulting wheel to have a requirement on >=$NUMPY_I_BUILT_AGAINST. There’s no requirement that the build depends and install depends have the same version range.

msarahan · April 24, 2019, 3:40am

Yes. Is there an expression in setup.py or requirements.txt for that? Does pip know about this in creating wheels? Should it? What would a solver do when presented with a build requirement of >=1.8? Your post from ages ago on “setup.py vs requirements.txt” (setup.py vs requirements.txt · caremad) was and is excellent, but an awful lot of people still only provide requirements.txt anyway. On top of that, it appears that only pyproject.toml supports build requirements, too? No way to specify build-time dependencies of setup.py · Issue #2381 · pypa/pip · GitHub - I suppose several projects have found hacks to achieve it with setup.py. The problem is not one of “are there ways to do this right?” but one of “how easy is it to do it the right way?” where the “right way” is defined as packager doesn’t give up in frustration before the package is built, the package “just works” wherever possible, and provides meaningful feedback about why it won’t work otherwise.

Are there established community guidelines for how to understand and correctly handle situations where binary compatibility come into play? On the build side? On the user install side?

These are all things that can be done with any tool in any ecosystem, and I don’t really mean to say “conda’s better because this” - I mean to point out particular workflows that require extra attention, and which would benefit from additional metadata.

njs · April 24, 2019, 4:39am

For the numpy case specifically, the correct thing to do is for people who use the binary ABI to put something like this in their setup.py:

install_requires=[
    "numpy >= " + ".".join(np.__version__.split(".")[:2]),
    ...
]

This is sort of silly – numpy ought to provide a helper, so you could write install_requires=[np.get_abi_version_constraint(), ...] or something. And I don’t think anyone actually does this correctly right now.

But the actual constraints are totally specifiable in the package metadata – the issues are just that we don’t make it easy to generate the correct metadata. So it’s a build system problem, not a package format problem.

In practice this almost never bites people, because either they’re using a pre-compiled wheel from PyPI that was carefully built to have the correct install-requires metadata, or else they’re building locally against whatever version of numpy they’re using, and people rarely downgrade numpy. It’d be nice to fix but it’s not currently a big pain point AFAICT.

Now, part of the reason why this works, is because numpy has intentionally designed its binary compatibility guarantees to make it work. If packaging metadata was more flexible/powerful, then numpy would have more flexibility about how to evolve its ABI. At some point I came up with an idea to handle this, by allowing what Debian calls “virtual packages” – so e.g. numpy 1.16 might say “I can also fulfill requirements for the packages ‘numpy[abi-1]’ and ‘numpy[abi-2]’”, and then packages built against numpy would declare that they required ‘numpy[abi-2]’ to be available, and numpy could potentially drop support for old ABIs over time. There are probably other ways too.

Practically speaking, I suspect it will be easier to teach conda how to download and install wheels from PyPI natively, and make pip say “hey, this looks like a conda env, you should install wheels using conda”, then it will be for pip to figure out how to natively manipulate metadata for (conda, apt, rpm, nix, apk, homebrew, …).

This sounds like some really important experience, and I’d love to hear more about it. Ideally before pip’s resolver is implemented and we get to rediscover the issues from scratch :-). If you’re up for it, maybe you could start a new thread to share some war stories?

I’ve always pushed back against this, because inventing a tool that can compute whether two arbitrary binaries are ABI-compatible on the fly is a major research project. And even if you had it, it’s not at all clear to me that it would help anyway – even if you know that there might be some library somewhere in your package repo that has the symbols this wheel wants, how can you find that binary? The end goal is to compile these abstract ABI constraints into some expression in the underlying system’s package language, that rpm or conda or whatever can understand… but they don’t work like that; they want package names and version constraints.

So IMO the simpler and more useful approach is to make it possible for a wheel to say: “here’s a list of conda packages and version constraints that this wheel needs”. Of course now that wheel only works on conda, but on the other hand… it will work on conda! no open-ended research project required

jdemeyer · April 24, 2019, 7:38am

This idea is also being discussed in more detail at Support for build-and-run-time dependencies - #37 by jdemeyer (@njs: no I’m not trying to hijack this thread, this is a good-faith attempt to move the ABI-compatibility discussion in one place).