Wanting a singular packaging tool/vision

pf_moore · November 20, 2022, 3:57pm

Sorry. As usual, it appears we’ve ended up talking past each other - in this case, over the fact that we don’t even have a shared understanding of the meaning of the term “name”

To be honest, at this point I don’t have the energy or interest to try to resolve the misunderstanding. I was only really asking to see if I could dig out some facts on the “how much of PyPI is on conda-forge” question, and the take-away seems to be that “it’s hard to work that out”, which is fine, I guess.

I really do wish we didn’t always end up at this sort of failure to communicate. But I don’t honestly know how to fix it. I think the “conda” and “distutils” communities parted ways so long ago that it won’t be something that happens overnight. And the problem is exacerbated by the fact that, with everyone involved being volunteers, no-one really wants to spend ages debating terminology. I know I don’t

I was referring to Metadata 1.0 (2001). Maybe conda even predates that. But that standard simply defines a (Python) project’s name - and I was assuming that conda would expose that name somewhere for Python projects (because someone who needed to install “numpy”, or “scikit-learn” would need to start from the Python name). Whereas I think you were assuming I was talking about mapping conda package names to Python names by some sort of normalisation process?

Anyway, as I said above, I don’t think it’s worth setting off down this rabbit hole right now, so I’ll drop the subject.

layday · November 20, 2022, 4:17pm

I think the misunderstanding arose from referring to it as the “project name”. If you had said “distribution name”, I think it would have been clear you weren’t referring to the normalised name on PyPI. A simple misunderstanding is no reason to get disheartened, especially one as easy to resolve as this one was

pitrou · November 20, 2022, 4:29pm

Well, assuming the answer to the question is important (which I don’t think it is), we can make a quick guesstimate:

There are approximately 400k packages on PyPI
There are approximately 20k packages on conda-forge
conda-forge is not only made of Python packages, but still, it’s being used mainly by Python developers in my experience. So, as a rough guesstimate, let’s say 50% of conda-forge packages are Python packages.
There are therefore approximately 10k Python packages on conda-forge. Assuming the majority of those also have a PyPI entry, we get that 2.5% of PyPI packages (including packages with no binaries on PyPI) may be present on conda-forge.

So you can accept 2.5% as a ballpark number, or you can be reasonably sure that the actual number is somewhere between 1% and 5%.

That said, as I alluded to above, I think this is a rather pointless statistic, because of the immense majority of PyPI packages nobody will ever get in contact with. Many PyPI entries are stale or unmaintained, others are simple tests or rough experiments that were abandoned.

conda-forge is much younger than PyPI, but more importantly it is curated. So it’s likely that most of these 20k packages are still getting maintenance from conda-forge packagers, and they are getting actual usage as well.

A more interesting statistic would be, out of the 5k most popular packages on PyPI (using a metric of popularity that only takes into account the 5 last years, to avoid selecting past glories such as the entire Zope ecosystem ), how many are not present on conda-forge?

bryevdv · November 20, 2022, 5:42pm

I guess some history could be helpful here. It’s worth noting that at the time conda was conceived:

The PyPA had been around less than a year
Wheels did not exist yet

We had been advised explicitly by Python core developers to “go our own way” with packaging, if existing Python tools did not support scipy/pydata community needs. So we did.

As someone else noted, the “normalization” affected distribution names (i.e, the filename), not package names, per se. And the reason for that was in order to include more detailed platform descriptor data in the filename itself. And I’m sure that tradeoff was in order to make something somewhere else simpler, or whatever (my memory is a bit hazy here).

Would I make the same decision today, with the current (much better) state of python packaging tools? Definitely not. I alluded to that much above. If I could wave a wand, what I would wish for today is:

conda-style environments (because a link farm is more general)
wheel packages for most/all Python packages (because they are sufficient)
“conda packages” (or something like them) for anything else, e.g. non-python requirements

But history is just history, and all projects have some. At some point things just become old decisions that are hard to undo (though it sounds like this specific one largely has been, by and large ^[1]).

These days I am just a conda user, I haven’t been personally involved in conda development for many years ↩︎

pf_moore · November 20, 2022, 6:02pm

Thank you. My recollection was unclear, because I wasn’t really aware of the original development of conda. My feeling was that it was even earlier (my own involvement in packaging goes back before wheels, to the creation of distutils) but that’s after-the-fact reasoning, and I hadn’t bothered looking up details.

Many of the changes PyPA is working to have extremely long timescales (for the same reason, getting people to move off “legacy” approaches is an incredibly slow process, if you don’t want to just alienate everyone). So as a serious question - why not work towards that goal now? It may not happen for a long time, but we can make incremental changes, establish standards for new projects, etc? That’s very much the norm, so I don’t see why it couldn’t work for something like this.

Of course, there’s no guarantee that everyone shares your view on the ideal solution (and if you’re looking to standardise on conda-style environments, that will include the core devs, as venv is a stdlib facility) but I’d hope that negotiation and compromise isn’t out of the question here

rgommers · November 20, 2022, 8:08pm

I think this is very much a desired solution direction. Every other packaging system is busy repackaging lots and lots of pure Python projects, which is a time-consuming and fairly pointless exercise (from first principles, it just has to be done now because dependency solvers). On the other hand, for non-pure projects it’s essential to rebuild them - and those are the ones for which PyPI is problematic by design anyway.

Something like does not have to be, and imho should not be, specific to Conda. It will have to involve some name mapping mechanism anyway for when package names don’t match. For Conda there’s a ~90% match so the mapping is easier, but qualitatively it’s the same thing as mapping to python-pkgname in Debian or py-pkgname in Spack + similar minor name variations for <10% of packages.

pf_moore · November 20, 2022, 9:14pm

Excellent news

I do still fear that we might be talking past each other here. @bryevdv said

and I took “non-python requirements” here to mean libraries that Python extensions might need, but with wheels for the Python packages like numpy. I’m still unclear on how (for example) installing numpy under a python.org Python would pick up such a “conda package” of whatever non-python dependencies numpy might have, but I assume that’s a detail that would need to be sorted out, and that’s fine.

But when you say PyPI is “problematic by design” for non-pure Python packages, I wonder if you are suggesting that a non-pure package like numpy would simply not be available on PyPI, and if that’s the case I don’t see how, in your view, a user of the python.org release of Python would get access to numpy.

Again, I’m sure these questions can be resolved, and I don’t expect anything to happen quickly, but I do think it’s important we start the way we mean to go on, by being very clear with each other so we don’t build misunderstandings into the debate at the very core.

To reiterate, for the sake of clarity - my key goal is to ensure that users of the python.org release of Python, the Windows Store python, and any other builds that aren’t “part of a distribution”, still have access to tools like numpy, pandas, scikit-learn, etc. And to minimise the effort required from the developers of those packages to make that happen.

rgommers · November 20, 2022, 9:53pm

I don’t think that’s feasible to be honest. That is basically “unvendor now-bundled dependencies from wheels into a different package manager”, which I don’t think can work. numpy wheels must remain working with only things from PyPI, which means vendoring some dependencies.

Despite the topic of “singular packaging tool” that this thread started with, I don’t think proposing a massive break like “no numpy package on PyPI” has any chance of succeeding (nor am I keen to deal with the fallout of that as a numpy maintainer).

That said, yes you are right that numpy contains C/C++ code, and that is problematic - best to get that from elsewhere. I am interested in evolution in a (mostly) backwards-compatible way, so we enable pure Python packages from PyPI + everything not-pure-Python from elsewhere.

pf_moore · November 20, 2022, 10:21pm

So to be clear, when you say “everything not-pure-Python from elsewhere”, how would you see someone using the Windows Store Python getting numpy in that situation?

Edit: I did see your comment “I don’t think proposing a massive break like “no numpy package on PyPI” has any chance of succeeding”, I just can’t reconcile it with the idea of “everything not-pure-Python from elsewhere”…

h-vetinari · November 20, 2022, 10:56pm

I don’t think it’s impossible, just not very easy. Pip would have to allow plugging in other package managers, which brings a whole lot of complexity^[1]. However that approach might still be more attractive to some people rather than turning pip into a fullstack installer like conda.

It’s even conceivable (as long as we’re brainstorming), that pip could fall back to a “fat” wheel that includes vendored binaries if it doesn’t find another package manager.

especially since not all package managers have the same conventions, i.e. naming of lib-artefacts, which the python-part of the package would rely on… ↩︎

CAM-Gerlach · November 21, 2022, 12:30am

I’m sorry, my fault for getting too focused on arguing one particular point and getting away from helping you answer your question, and for initially mischaracterizing the severity of the name-mismatch issue with respect to what you’re looking for specifically. I really should have known better than respond right away after staying up most of the night working on updating PEP 639 again. I’ve been trying to work on improving my communication in this regard, but I obviously still have quite some work to do.

In any case, distilling down my long ramblings above, you should be able to get a pretty good lower bound of P(conda | PyPI top N) (with maybe up to ≈5-10% error) by just running the following (untested) snippit I’ve cooked up (assuming PYPI_TOP_N_NAMES is a set of PEP 503-normalized PyPI distribution names for the top N packages):

url = "https://conda.anaconda.org/conda-forge/{}/current_repodata.json"
arches = ["noarch", "linux-64", "win-64", "osx-64"]
conda_names_byarch = [requests.get(set(url.format(arch)).json()["packages"].keys()) for arch in arches]
conda_names = set.union(*conda_names_byarch)
pypi_names_on_conda = PYPI_TOP_N_NAMES & conda_names
p_conda_given_pypi_top_n = len(pypi_names_on_conda) / len(PYPI_TOP_N_NAMES)

njs · November 21, 2022, 12:52am

I think the simplest way to make conda/pip play well together would be for conda to add first-class support for the upstream python packaging formats – wheels, .dist-info directories, etc. Then conda could see a complete picture of everything that’s installed, whether from conda packages or wheels, handle conflicts between them, etc.

This seems a lot easier than pip growing to support conda, because pip is responsible for supporting all python environments – venv, distro, whatever – while conda is free to specialize. Also the python packaging formats are much better documented than the conda equivalents.

And if we had a single package manager that was clever enough to understand mixed conda/wheel environments, then we could potentially define new wheel targets like “condalinux” or similar, that can be uploaded to pypi alongside more general wheels like manylinux, and can declare dependencies on a mix of conda and wheel packages.

But I’ve pitched this to the conda folks every so often for years now and they’ve never followed up, so

idk maybe mambo would be interested?

CAM-Gerlach · November 21, 2022, 1:18am

I’m sure the Conda folks can address this more, but as I understand it from talking with them at SciPy about this topic, and from past discussions and statements, substantial improvements (much less “first class” support) to the interoperability conda currently offers would require new metadata and standards (some form of Wheel 2.0 was discussed, given @dholth had recently joined Anaconda) on the PyPA side.

Conda already nominally supports quite a bit of that, but the issue is it can’t do so safety or reliably without a significant chance of things breaking over time, because some of the concepts for how packages are managed (e.g. extras vs. -base/variant packages, vendored vs shared binary deps, every package building their own wheels vs. a central set of managed compilers, etc) just don’t currently map 1:1 between the two ecosystems.

My impression at SciPy was that there was quite a bit of interest on the Conda side in helping propose the standards that would allow that, and I was interested in helping that along, but I’m not sure the progress on that.

njs · November 21, 2022, 3:19am

I don’t understand how any of those are problems – they don’t have to map 1:1 as long as a tool understands both sides. pip can install working packages into a conda python right now with extras+vendored deps+maintainer-produced wheels and everything works fine at runtime – it’s only when you want to add/remove packages afterwards that you get problems, because conda and pip can’t see each other and stomp on each other’s state.

But sure, I could be missing something – if there’s some important blockers that we don’t know about then I guess the next step is for them to speak up.

ofek · November 21, 2022, 3:43am

Neither have I, as a Windows user

pradyunsg · November 21, 2022, 5:58am

Is there some discussion/documentation where one can read more about the link farm being discussed/mentioned here? I’m mostly looking to get a better understanding of what you mean by this, so even an explanation inline would be helpful.

(Sorry if I missed it in this thread — it’s evolved a bit quicker than I can keep up)

barry-scott · November 21, 2022, 9:49am

And I use it all the time - I have many version of python installed at the same time and py.exe makes it easy to work with them all without PATH editing of using long paths to the python.exe.

And py.exe is recommended on the users list all the time to run python on windows.

jezdez · November 21, 2022, 10:44am

Hey all,

I’ve been following the thread a little and thought it would be useful to give a quick heads-up that I’ve been working at Anaconda for a bit over a year as a tech lead on conda, and we’ve been heads down on catching up with technical and organizational debt of the past ~10 years.

A while ago, I co-founded PyPA and maintained pip/virtualenv in the early days (literally taking them off Ian’s hands), as some maybe still remember. I took the job at Anaconda in particular because I think conda has played an important, parallel role in enabling a huge amount of Python users to solve their tricky packaging problems. The elephant in the room is the continued growth of the Python community and the need for a diverse set of tools (based on standards/specs) to cater to it. I hope to build bridges between conda and the PyPA stack as much as possible to improve the user experience of both ecosystems. I’m very excited to see the current generation of packaging tools like Hatch, PDM and Poetry.

Here are some broad stroke comments/ideas:

conda is now a multi-stakeholder OSS project, which is being recognized in recent updates to its governance policy. If anyone from the early days of conda reads this, I hope you enjoy seeing that. Anaconda is still invested (and increasingly so) but it’s not the only stakeholder anymore.
The conda maintainers have made the same painful mistakes of over-optimizing for a particular subset of users in the scientific community, like PyPA has done for web/infra users.
The problems described in this thread about covering the “full packaging stack, with non-Python dependencies” are real and largely solved by conda (among others), and the main focus is now on keeping up with 3rd party ecosystems like Python/PyPA and catering to the community growth.
The growth of conda-forge (and to a lesser extent Anaconda Distribution) continues to require investment into scaling the build and distribution infrastructure, improving build tooling and catering to user needs.
conda/PyPA compatibility is not perfect, there is a badly named pip_interop_enabled config option which makes it take Python packaging metadata into account for some subcommands. This will hopefully be improved and become the default in the future.
Conda specifications, schemas and enhancement proposals are steadily being written by contributors and voted on by the Conda Steering Council.
The discussions I had at SciPy this year with @CAM-Gerlach and @henryiii about trying to align the conda packaging format with the wheels format were mostly about getting the temperature in the room, and I hope to be working with @dholth in the future on it.

Full disclosure, I don’t love engaging with these catch-all threads, since they tend to miss the nuance of the topic and lack actionable results, but I’d be interested in being proven wrong

So if you have any specific questions about conda, do not hesitate to ask!

pf_moore · November 21, 2022, 11:08am

Two specific questions, if I may.

Do you ever see conda being usable for python builds that are not themselves managed by conda?
Do you see conda builds of python packages being used (directly or via some sort of repackaging mechanism) by non-conda tools like pip?

For me, those are the two key factors that will determine whether we should be thinking in terms of a single unified ecosystem, or multiple independent ones.

h-vetinari · November 21, 2022, 11:20am

Not speaking for Jannis obviously, but just to note that w.r.t. to repackaging, there’s been an effort to do that (unfortunately stalled; even though I just saw it’s listed as “fundable” by the PSF) with GitHub - conda-incubator/conda-press: Press conda packages into wheels. Not sure if @scopatz is reading here, haven’t seen him around in quite a while…

But at least this shows that (possibly minus some corner-cases), it should be technically feasible.