Wanting a singular packaging tool/vision

fungi · February 13, 2023, 7:14pm

Plenty of GNU/Linux distributions preinstall packages of Python
libraries, many have even written some of their required system
tools in Python so can’t realistically uninstall the Python
libraries those rely on. This is a big part of the reasoning behind
recent features to mark environments as externally managed or
require pip to install into a venv in order to avoid such conflicts.

Yes users often have unrealistic expectations, and the best thing
anyone can do for them is to provide clear error messages rejecting
their attempts to act on those.

pf_moore · February 13, 2023, 7:17pm

But that’s the key point - there are two tools managing site-packages, just like for a Linux distro. Which is what @pzwang said, and what I agreed with. But @rgommers disagreed…

Oh, and I really hope this is of use to people other than me! If everyone else is clear and I’m just being dense, please just tell me

steve.dower · February 13, 2023, 8:31pm

Yeah, I didn’t want to go too deep into this part, since it really is the topic and not the background. But now this is a separate post…

I think this is where the issue just becomes worse for Conda compared to distros, in part because of the scenarios, but also because the Linux distros have partially mitigated things.

Basically, it starts off worse because Conda’s scenarios are “packages that are hard to compile in isolation”, while Linux distros tend to be “packages that we need for other tools in our repo”. Notably, the latter implies that users shouldn’t know or care about those packages,^[1] whereas Conda is providing packages specifically to satisfy a user’s request - it has a stronger obligation to provide everything. If Conda’s users were more “just give me a working environment and I don’t care what’s in there” then they’d be less demanding, and would run into less issues trying to force things into a particular shape.

The other aspect is that Linux distros already separate out their own packages from user-managed ones, whether with user site packages, dist-packages, or some other way. They then have ways to ignore the ones the user provided (e.g. python -s to ignore user site packages) when running their apps that rely solely on packages provided by their repository. (Personally I’d have set this up with a dedicated directory for these packages that is not on sys.path by default, and manually add it in when running a script, e.g. with PYTHONPATH, or some directory relative to the script that is automatically detected by the runtime.)

Conda is not this scenario at all. All of its packages are for the user to use, so there’s no sensible separation of concerns. Anything installed by Conda somehow needs to merge with anything installed by pip to form a single environment providing everything the user needs. (Maybe this is going back too far, but if you recall bdist_exe, then anything installed by one of those needed to merge with anything added by easy_install in much the same way.)

Right now, to my knowledge, there’s no equivalent scenario out there. Conda+pip is quite unique, because nobody else is really trying to automatically merge two different package repositories like this. The standard is very much to use a single tool for a single repository for each “project”/“environment”, because this is way more manageable. Conda-Forge makes this way more feasible for many Conda users, because now they have a repository that is way more likely to satisfy the packages they need without having to resort to merging two different repositories.^[2]

What I’ve not seen written down before is an actual spec from the Conda side of how they would like pip to behave in the presence of conflicts when merging repositories. I’m fairly certain such a spec could be implemented without any actual special cases for Conda,^[3] but without spelling out what behaviour they want, none of us have a chance of providing it. (And it gives something concrete for devs to argue over and push back on.)

Even more importantly, without that spec, none of us have a chance of communicating to users what to expect. Some users expect to “conda create” an environment and have pip handle everything, while others expect conda to handle everything. There’s no clear line, which means there will always be users whose expectations cannot be met, and so there’ll be complaints.

tl;dr: pip/Conda is the hardest interop problem we - and likely anyone - faces in packaging, in large part because nobody has defined how they should interact.

Unless the user is a developer who’s building a package that could go into the repo, in which case they’ve already opted into using the dependencies that are available. Hence, they won’t ever complain. ↩︎
Conda knows how to merge its own repositories, because that’s built in to the system. Which is why I refer to it as a single repository. Though if you’re careless (or more likely, if the publishers are), you can get into the same issues here! ↩︎
e.g. “pip should not uninstall anything Conda installed” is implemented by not providing a RECORD file. “pip’s dependency resolution should be constrained by any existing packages that it can’t uninstall” might need a change on pip’s side, but it’s not specific to Conda. ↩︎

rgommers · February 15, 2023, 2:04pm

I don’t think Conda is unique here, it only comes up more because it’s the most popular tool with this set of capabilities. Spack is very similar to Conda in these respects, and is getting more popular even outside of HPC., And you can also get this from Nix (more “foreign”, but there’s for example flox which tries to make Nix more approachable - and they just raised a lot of venture funds so has at least the potential to get more prominent).

Steve’s reply was hopefully helpful. I’m not sure it can get more useful than either such a reply or my analogies like pyenv-like and pip-like.

Great point - this would be super helpful to have. I indeed don’t think it exists right now. I’ll try to get something in written form from the Conda community. It may take a while, because it’s not an easy question, and I’m sure that there will be different opinions to take into account.

BrenBarn · February 17, 2023, 7:48am

I’m quoting you here but just as a jumping-off point for some thoughts, so don’t take this as directed at you specifically.

Steve Dower:

What I’ve not seen written down before is an actual spec from the Conda side of how they would like pip to behave in the presence of conflicts when merging repositories. I’m fairly certain such a spec could be implemented without any actual special cases for Conda,[3] but without spelling out what behaviour they want, none of us have a chance of providing it. (And it gives something concrete for devs to argue over and push back on.)

Even more importantly, without that spec, none of us have a chance of communicating to users what to expect. Some users expect to “conda create” an environment and have pip handle everything, while others expect conda to handle everything. There’s no clear line, which means there will always be users whose expectations cannot be met, and so there’ll be complaints.

tl;dr: pip/Conda is the hardest interop problem we - and likely anyone - faces in packaging, in large part because nobody has defined how they should interact.

It’s a great question and hopefully we can get some answers from the conda devs. However, I’d like to back up a bit and ask the more general question: in our envisioned future utopia, why would someone use one package manager to install packages in an environment managed by a different manager? In other words I’m less interested in details of interop between the tools we have now, and more interested in what kind of tools we would want to have and how they would interop.

Right now, this question is rather ill-posed, because the problem is that conda is an environment manager and pip is not. So the two are not even the same sort of beast, which is part of what leads to confusion for users.

If we set this aside, right now as far as I can tell, the main (only?) reason people would use Tool X to install packages in Tool Y’s environment is that Tool X has access to a larger repository of packages, and in particular it has a package that can’t be gotten with Tool Y. In our current world this means “I used pip because the package I wanted wasn’t available with conda”. But this has nothing to do with the technical features of the tools; it’s just that PyPI has the widest name recognition and the most well-worn paths and most common tutorials and whatnot explaining how to use it.

As far as I can tell, there are no in-principle reasons why every PyPI package could not be made conda-installable (or at least why every “reasonably sane” PyPI package couldn’t be).^[1] There are some practical reasons, a big one being that different names for the same package need to be reconciled. (Some discussion of the question can be found on this old conda issue.) But if there were a team of dogged laborers who just manually created conda packages for every PyPI package, then there would no longer be any need for anyone to use pip to install packages.

The reverse is not true, because pip is not equipped to deal with non-Python dependencies, nor can it manage the version of Python itself as part of the environment.

Pushing this forward into the utopian future, what this means to me is that in that future there would not really be any need for interop between a pip-like tool and a conda-like tool, because there is no need for a pip-like tool to exist. If every package anyone wants can be installed with an environment manager that can also manage non-Python dependencies, there would be no need to use a tool that cannot do that.

I’m not saying this to advocate for conda in our current world. Rather, my point is that what “a singular packaging tool/vision” means to me is, well, a vision — a vision of what we want packaging tools to be like rather than just a discussion of the specific issues we face with our current tools.

So on this specific matter to me the lesson is: we should not double down on making the default, official, most-publicized, tool with the largest, default, most official, most-publicized repository be one that is not also the most full-featured one. I am not sure that in the grand unified future there even needs to be a tool that does the particular subset of things pip currently does. And if that is the case then we don’t need to worry at all about how something like pip interoperates with anything.

By “reasonably sane” I mean to exclude things like packages with a setup.py that runs extensive code that does weird stuff that shouldn’t be done in an installer, or packages that refer to nonexistent dependencies, are otherwise irredeemably broken, etc. ↩︎

h-vetinari · February 18, 2023, 1:34am

Indeed.

The by far biggest part is doing the integration work of ensuring things are according to the rules and standards of conda-forge (e.g. only depending on packages already in conda-forge, no vendored code, etc.), which play a large part in keeping things tractable and stable.

In contrast, there’s no curation on PyPI, so people can push whatever they want, and the gaps between that and said rules in the conda world can be large.

To a degree, exactly this is happening at GitHub - conda-forge/staged-recipes: A place to submit conda recipes before they become fully fledged conda-forge feedstocks, where any PyPI package can be transformed in this way. It’s just a slow process because everything is volunteer-run, and mostly depends on people converting the packages they’re interested in.

BrenBarn · February 18, 2023, 2:03am

Yes, and I think that makes the conda-forge better for users. It is only a limited form of “curation”, because no one is manually picking and choosing packages that are or aren’t allowed in. It’s just that it’s gated on actually working, which. . . well, to me that seems like an advantage.

Yes, which is great! But the “problem” is that everyone keeps pushing their stuff to pypi as the default. I mean, as I said before, it’s not really a problem in our current world, but I think it is an obstacle to the utopian future, because again it means most energy is going towards pushing packages to a place where there are fewer capabilities and fewer guardrails to ensure that what end users get will work.

CAM-Gerlach · February 18, 2023, 5:49am

There’s also grayskull to automatically create Conda recipes from PyPI packages (which typically need little to no tweaking for most pure Python packages that aren’t doing anything crazy), and even has an online version. And once the recipe is created and merged, the autotick bot keeps it updated whenever the PyPI version is with relatively minimal maintainer intervention unless something breaks.

h-vetinari · February 18, 2023, 6:18am

That’s the ideal, but the autotick bot cannot detect dependency changes (even just of the version bounds), because we don’t have structured dependency metadata across python packages (another big topic, and unfortunately separate from the lockfile effort), so it generally needs – or at least strongly benefits from – human attention.

CAM-Gerlach · February 19, 2023, 4:09am

Yup, right—a central mapping of PyPI to Conda-Forge distribution package names (and defaults/anaconda, Homebrew, MacPorts, Linux distros, etc. — and adding import package names to that would allow for automatic dependency installation), as @rgommers has proposed would mostly solve this, if it existed. If some funding, attention or other community resources could be directed toward that as part of the whole packaging strategy discussions, that would certainly make a lot of things easier for both users and unified tools trying to navigate between different ecosystems.

pf_moore · February 19, 2023, 8:14am

Assuming something like this gets set up, what would be the ongoing workflow? If I, as a package developer, want to create a new package my-lib, do I still just register the name on PyPI? Who decides on (and adds) the names for the other systems?

I assume it would be as now, with the packagers for those systems choosing their names and linking them to my name in the registry. Is that correct?

h-vetinari · February 19, 2023, 8:54am

It’s not just the name mapping (though that plays a role as well), but the fact that python projects may specify it’s dependencies in a setup.py that’s completely intransparent from the POV of any artefact (e.g. wheel metadata), unless and until that setup code is actually executed…

CAM-Gerlach · February 19, 2023, 10:55am

I can’t speak for Ralf and I couldn’t find his more detailed posts on the topic, but yup IIRC, with the upstream (i.e. PyPI) distribution package name being the canonical key—I don’t really see another practical way it could work. Presumably, instead of everything being manually updated in the registry by individual, it would scrape the existing downstream config files (Conda recipes, Fedora spec files, Homebrew formulae, etc.), with the remaining gaps being filled by either enhanced downstream metadata (ideally), or as a stopgap overrides on the registry side.

EDIT: Rewrote some parts here for clarity, better organization and refined language

Yes, for source trees still using setup.py instead of more modern formats, but the situation for source and built artifacts (i.e. what autotick is primary concerned with) is better, and would be a lot more so except that multiple accepted standards have unfortunately been waiting year for implementation in PyPI/Warehouse (which would seem to me a far more positively impactful use of limited funds than to pay someone to spend a year or more to tell us what everyone already knew, that users want a singular packaging tool/vision).

For wheels, the dependencies for a given set of wheel tags can be determined declaratively by reading the METADATA file from the wheel, via the PyPI JSON API (albeit only that of the first uploaded wheel, AFAIK), or per PEP 658, via allowing downloading the METADATA file directly from each wheel. Unfortunately, the latter still hasn’t been implemented on PyPI, despite a PR being open for years due to lacking maintainer review.

For sdists, once Metadata 2.2 support is implemented, sdist dep metadata could also potentially be marked as trustworthy (though AFAIK, requires is listed outside of the PKG-INFO, so not sure if that will work with all backends, e.g. Setuptools—maybe Paul or the Setuptools folks would know more). Once again, the ecosystem has been largely blocked for years by it not being implemented in PyPI either.

And for source trees, pyproject metadata (PEP 621), is now supported by Setuptools, as well as a conversion tool from setup.cfg, so the only remaining holdout is Poetry, but apparently that’s finally in the works for the next major version. If that’s not used, the backend-specific formats (setup.cfg, [tool.*], etc.) are pretty much all static/declarative. If the project still has their metadata in a dynamic setup.py, or you don’t want to worry about the details of the specific format, you can at least call the prepare_metadata_for_build_wheel pyproject build (PEP 517) hook to get the metadata (which could involve code execution of course, though only if required by the backend, and generally only the amount needed to extract metadata).

PythonCHB · February 20, 2023, 8:16am

We’re talking about a vision for the future here, yes? Not hard to say that your package can’t take part if you use a legacy system

As far as names – I’ve always thought it was unfortunate that conda started out mirroring PyPi names for packages – it sure seems handy, but as conda handles non-python packages, it wasn’t too long before conflicts / confusion began. This has been talked about for years, but I don’t think there have ever been any clear conclusions. But it’s not too hard to imagine that there could be an (optional?) prefix for PyPi packages:

conda install pypi_pac_name

Then you’d know what the package name is on PyPi.

PythonCHB · February 20, 2023, 8:21am

I"m not sure that’s that much of a problem – conda-forge uses PyPi as the source for most Python packages – it’s nice to have one place to do it.

The lack of curation is unfortunate, though – there’s a lot of “cruft” on PyPi. Maybe at some point it will make sense to have a more curated PyPi – and the free-for-all version could remain as a place to publish stuff to the world to find – and other package mangers to use – but not the default place for, e.g. pip to search for stuff to install.

I think yes – what we have now isn’t too bad – a package author can decide to submit a recipe to conda-forge is they want, or anyone else can do it. It would be nice if that process was more streamlined, but the main reason it isn’t isn’t technical (as others have noted, with grayskull, the process can be pretty easy) – it’s the curation that takes time. But the curation is worth a lot, too. And of course, the curation is a conda-forge policy – anyone’s free to make their own channel without curation if they want.

BrenBarn · February 20, 2023, 7:53pm

Yes, that is exactly the problem, and I think “at some point” is now (or rather, several years ago). This issue was also mentioned in the background discussion on the pypackaging-native website. It’s fine to have a free-for-all repository where people can upload random stuff; it’s also fine to have a place where packagers can go to get source to redistribute it as part of some collection. I don’t think it’s at all a good idea to have that be the same repository that is searched when someone takes the standard built-in package-installing tool and says “install this package for me”.

Whether that amounts to “curation” is a judgement call. I think there is a certain minimal level of curation without which any repository will become effectively useless (if, for instance, no action is taken to remove malicious packages). It’s just a matter of deciding what level is going to be required for the repository referenced by the default package installer that comes with Python.

PythonCHB · February 20, 2023, 8:06pm

I agree – I was trying to be polite

Well, there is an infinite number of levels of “curation” possible, it’s hard to draw a line – but anything other than “anyone can upload anything at anytime” is a good start.

I don’t know how carefully conda-forge’s curation policies are spelled out, but it’s pretty much focused on “this package should work well with the conda-forge ecosystem”, without any judgement as to whether its a good or useful package.

As for malicious code – I don’t think there’s any careful screening for that, either – that’s a hard one, but at least it gets human eyes on it, so obvious stuff will get caught.

pf_moore · February 20, 2023, 8:20pm

If someone were to build such a “more curated PyPI”, and it was a success, then it would definitely make sense for pip to use it as a default (assuming organisational questions such as ownership and governance could be sorted out). But it’s not going to happen if we simply hope that the existing two^[1] PyPI maintainers will somehow gain the bandwidth to do this.

I think it’s two - I might be over-estimating. And no, I’m not joking. ↩︎

BrenBarn · February 20, 2023, 9:11pm

But they did, and it is, and it’s called conda-forge. That’s the observation that spurred this whole line of discussion.

pf_moore · February 20, 2023, 9:38pm

OK. You said “the same repository that is searched when someone takes the standard built-in package-installing tool” and I assumed you were talking about pip, rather than just reiterating your position (that you’ve already made very clear) that you think everyone should switch to conda.