Pip/conda compatibility

+1 from me FWIW, looking back AFAIK nearly every reply after that comment was related to that topic

Thanks for the thread split, @brettcannon :slight_smile:

I made time today to draft a PR that adjusts the way PyPUG covers the installation environment PEPs based on the issues I filed last week: Clarify specs for external installer interactions by ncoghlan · Pull Request #1215 · pypa/packaging.python.org · GitHub

Gist of the changes:

  • actually link to PEP 405 (virtual environments) and PEP 668 (EXTERNALLY-MANAGED) from the spec page
  • new “Package Installation Environment Metadata” subsection listing those two PEPs, as well as the existing pages for the installation database, entry points, and direct URL recording
  • add the new installation database subsection that’s explicit on how to combine omitting RECORD and adding INSTALLER to declare an installed package as off-limits to other tools
  • clarify in the new virtual environments page how virtual environments should be detected at runtime, and how the PEP 405 mechanism can be used to establish that state in non-venv use cases
  • improve the internal cross-referencing between the installation database specs

Nothing I’ve covered there is new, but it wasn’t as clear as it could be to folks that weren’t involved in creating the referenced specs (hence the title of the PR).

The two roles I identified in The Python Packaging Ecosystem | Curious Efficiency still hold: pip is a plugin manager for Python runtimes, conda is a cross-platform environment manager (where making data analysis with Python easier was the original motivating use case).

There’s just significant demand for binary management approaches that still work regardless of how the underlying Python runtime was obtained, and in some cases it’s feasible to provide those capabilities.

In the long run, redistributors benefit from those changes too, as it can help take a lot of the manual rework out of the redistribution process (while there are tools to auto-generate Linux packaging specs and conda recipes from PyPI packages, they fall down at the same place that pip install does: when there are build or runtime dependencies that the Python packaging metadata isn’t sufficiently expressive to convey).

5 Likes

Nice definition – though there are two complications / confusions:

  1. pip is widely used with environment managers – to the point where it’s unclear to many (or at least me :slight_smile: ) where the distinction is between the “package installer” and the “environment manager”. And recent PEPs have certainly reinforced the “blending” of those roles (PEP 668, 704, 582). In my comments about pip / conda interaction, this is most of my issue – if you want pip to work well with conda (or any other package / environment manager), pip shoudln’t assume anyting about environments.

  2. conda is an environment manager, but it is also a package manager – in theory, you could have conda packages without the isolated environment features (e.g. homebrew) – it would be much less useful, but not useless – providing a variety of binary packages would still be useful. Related to the previous point – perhaps it’s problematic to try to make package and environment management orthogonal.

I think conda is about a lot more than simply managing the Python runtime. I don’t think the conflicts that arise have to do with that point, particularly.

Well, maybe – I’m not sure it’s expressiveness exactly – e.g. I suppose a python package could have metadata that says “you need libjpeg version x.y to build this” but the challenge is that exactly what a library is called, and what it’s versioning scheme might be is not universal – yum has its naming scheme, conda-forge has its, apt-get has its – so it would require a lot of careful coordination and mapping of naming schemes to make this useful. Which is what I was getting at with this point:

PS:

I can’t ague with that – it did come from continuum analytics, but I don’t think it’s helpful to keep bringing up that point, for a few reasons:

  • “data analysis” is a pretty unclear (and new) term – I’d prefer “scientific computing” or maybe “numeric computing” – but those aren’t good either – maybe “anything that relies on numpy” ?

  • Even if that was the motivation originally, it was VERY quickly expanded – the “real” motivation was making it easier to work with non-python libraries (usually C libs) in a Python context. And this was helpful to a lot of things we don’t think of as “data analytics”. In fact, I discovered conda AFTER starting the MacPython gitHub org as a place to build binaries for the Mac (GitHub - MacPython/mac-builds: Gattai recipes for building various pyton packages for the Mac – not touched since 2014 – since I left it for conda, the MacPython org. has been used by others for some great work building wheels instead) which I started because it was really painful to get things like netCDF4, GDAL, libgd, even Pillow (PIL), which all relied on other common C libs like libpng, xlib, libjpeg, …

    I also noticed that I often used literally 4-5 python packages in the same project that all relied on some of the same C libs (wxPython, GDAL, Pillow, Matplotlib all use libpng, libjpeg, etc.) Maybe there’s no practical problems with each of these statically linking, but it sure feels like a kludge :slight_smile: Anyway, when I found conda, it seemed to be a much better solution to these problems. And that was not just because I could get many of the packages I wanted without having to build them myself, but because it provided the required C libs as well, so I could create my own conda packages that required them without rebuilding them. And this has very little to do with “data analytics”.

2 Likes

But that’s not pip’s doing, that’s other tools that are most likely calling pip. And I would argue the PEPs you listed just tell pip what to do when faced with certain scenarios involving a virtual environment, not managing them (i.e. pip never creates a virtual environment).

Not until you ditch the arbitrary shell script execution that conda environments support on behalf of conda packages that use that mechanism.

Worth noting that PEP 668 specifically made it so pip didn’t have to assume anything about the environment (it could check, without assuming), and PEPs 704 and 582 are both stalled, largely on the basis that pip should not become an environment manager.

The only real blending that’s going on is user assumptions about how things should work. Which is valid, but leaves us the option to resist the blending, rather than showing that it’s actually happening.

You can do data analysis without Numpy, using e.g. Arrow-based tools. And Numpy is relatively easy to build and package, compared to other libraries such as PyArrow or anything CUDA-related.

Some libraries, though, don’t support multiple static copies very well. For example, it turns out that having several copies of protobuf (or gRPC, I suppose) loaded in a process can be a PITA.

There might be ways around those issues but they are often painful to make work and maintain, especially in a cross-platform way.

I guess I misspoke – I meant “something much like conda” – I wasn’t talking about the details of conda implementation.

Which is a good thing,yes – it provides a way for the environment (loosely defined) to communicate to pip, rather than the other way around. It seems to be quite focused on the Linux-y use case, but I hope to spend a bit of time seeing how conda might be able to take advantage – maybe it’s more flexible than I think it is – I’ll have to try it out.

yes, exactly :slight_smile:

Well, numpy is used as an interchange protocol between python and libs written in C, etc. And compiled code that relies on the numpy API is a big issue that conda set out to solve, which is what I meant. And at least on conda-forge, PyArrow depends on numpy, and I’m pretty sure at least some of the CUDA libs do too.

But your point is well taken – what is “data analysis” as it relates to Python? What are the use-cases for a conda-like system? It’s all very hard to talk about.

We discussed this one in PEP 704 - Require virtual environments by default for package installers. The greatest value is that Conda should be able to not provide the file, and allow pip to continue working normally (and the problem with the main proposal in that thread is that pip would start looking for other signals to make essentially the same decision). But Conda’s own environment (the one with Conda installed) could include the marker to discourage people from modifying it directly.

yup – I’m going to play with that a bit, and it might make sense. But if conda does decide to make the “base” enfironment special and protected (which I’d love to see), it would have to do that for both pip and conda itself. – and, in theory any other package installer that might be run inside conda.

Anyway, what I’d like to see is perhaps to set that flag for all conda environments so that you can’t pip install something into a conda environment without a warning that you should make sure there isn’t a conda package that could do the job. And in that case, the flag name: “–break-system-packages” is less than ideal.

But I’m also going to explore setting pip configuration with different defaults and see how that works.

Experiment results:

I created an EXTERNALLY-MANAGED file and copied into the python in a conda environment. This resulted in:

$ pip install something
error: externally-managed-environment

× This environment is externally managed
╰─> While you can use pip to install packages into a conda environment,
    It is more reliable to use conda packages, if they are available.
    
    To install Python packages into a conda environment,
    try:
    
    conda install -c conda-forge pkg_name
    
    Note that sometimes conda packages have slightly different names than
    the python package.
    
    conda search -c conda-forge pkg_name
    
    may help you find it.
    
    If you are sure you want to use pip to install the package, then
    you can override this message with:
    
    pip install --break-system-packages pkg_name

note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.

So not bad, but the message provided by pip could be pretty confusing – that’s what I mean by the assumption of linux distro.
Maybe that message can be overridden as well.

Protecting users from breaking “the” conda environment has been a feature request for a long time and I know there’s been various thoughts about different solutions (Feature Request: Option for default (base) environment to be seperate from "conda" environment · Issue #10077 · conda/conda · GitHub). It would be great to see if this helps.

But as you say, it’s not just pip users that break conda environments, conda users can usually manage to do it all by themselves.

Take a closer look at PEP 668… you can customize the message quite
easily for the effected environment. If the EXTERNALLY-MANAGED flag
file has content in the expected format, installers will return text
provided there instead of the default error (Debian’s
libpython3.11-stdlib package is doing exactly that, for example).

“The EXTERNALLY-MANAGED file is an INI-style metadata file intended
to be parsable by the standard library configparser module. If the
file can be parsed by configparser.ConfigParser(interpolation=None)
using the UTF-8 encoding, and it contains a section
[externally-managed], then the installer should look for an error
message specified in the file and output it as part of its error.”

2 Likes

Did you take a look at my experiment ? – I provided a customized message, but pip wraps that with some additional text:

note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.

I imagine it’s there so that pip / cPython don’t get as many complaints – but that text is pretty oriented to the original use case.

As mentioned previously in this thread, there’s an active Conda issue to implement EXTERNALLY-MANAGED in the Conda base environment that was specifically prompted by the initial PEP 704 discussion, and is actively planned for implementation in a forthcoming release, just pending a final decision on UX and how to handle existing base environments:

2 Likes

Got it, you’re not complaining about the default error message,
you’re complaining about the note_stmt pip adds to its
ExternallyManagedEnvironment exception (in
src/pip/_internal/exceptions.py). It’s probably worth raising an
issue for pip in order to work out improved wording which both
retains the property of deflecting user blame away from pip but
still applies to environment managers you feel neither qualify as
Python installations nor OS distributions.

Will do – thanks.

Thanks – I’ll go join that discussion.

Being an environment manager doesn’t imply the ability to manage multiple environments at once. Some environment managers do have that (e.g., conda, Spack, Linux distros via containers), but for this purpose, the only necessary qualification is to be a way of obtaining a Python runtime. While an environment manager may also provide other non-Python things (and most of them do outside the Python-runtime-only Windows installers), the critical piece that makes an environment manager relevant to PyPA is that it provides a Python runtime. No Python runtime means no Python packages, and hence nothing that PyPA might need to care about.

Don’t try to think of the distinction as making environment management and package management orthogonal. Think of them as making the actual runtime environment layered. You have a base layer provided by the underlying environment manager (think homebrew, Spack, dnf/yum/rpm, apt/dpkg, python.org’s Python installer, Python from the Windows Store, ActiveState’s Python installer, and yes, conda), and then optionally a custom layer of Python packages installed via the Python-specific packaging tools (e.g. pip, ActiveState’s PyPM).

The way that the runtime environment communicates with Python packaging tools is primarily via the Python runtime itself (e.g. sysconfig paths, platform information, venv detection via the sys module), but there are also options for explicit overrides (e.g. the EXTERNALLY-MANAGED file, the importable _manylinux override module) that let an environment communicate its expectations and limitations to Python-specific tools.

And in that layered context, what makes packages with binary extensions and other external dependencies challenging is that they break that neat layering abstraction, since they depend on things that the Python-specific tooling has no way to describe. To date, the only solutions that anyone has found to avoid that layering violation are to either:

  1. Pull the external dependencies into the package (e.g. static linking, bundling dynamic libs and other commands). This often works (especially in simple cases), but it’s wasteful in many ways (including runtime memory, where multiple versions of the same library may get loaded), and sometimes outright doesn’t work (when a library doesn’t support having multiple copies loaded in the same process)
  2. Pull both the external dependencies and the things that depend on them down into the environment manager layer, which does have the ability to describe those dependencies (this is the missing information that repackaging a Python project for Linux or conda or any other platform adds)

For Python specifically, only the first option lives within PyPA’s area of responsibility. The second falls into the area of the environment managers, including conda.

Attempting to avoid the duplication of effort across the different environment managers by saying “What if we came up with a meta-description for external dependency declarations that could be automatically mapped to the environment manager specific providers of those requirements?” turns out not to work, since it turns out there a variety of common techniques that environment managers use for dependency management that mean that the dependency declartions can’t readily be generalised:

  • arbitrary renaming. Just because you know the name of the project that you depend on doesn’t mean that you know what a given environment manager calls either the library or the package that provides it.
  • version adjustment. Environment managers may apply patches to rebuilt packages and make changes to the package versioning information in ways that packages for that environment know how to handle, but more generally portable packages may not
  • splitting projects up. To minimise deployments, optional things (like C header files) are often separated out into supporting packages, so dependencies on those need to be declared separately

For folks on Linux, the system package manager is built to handle those problems. They have their limitations, which is one of the contributing factors to the popularity of Linux containers, but they do solve the problems they set out to solve.

homebrew essentially functions as a developer focused equivalent to a Linux distro package manager for Mac OS X users.

And conda slots in as a comparable system that not only works on Windows, but can also be used on Linux and Mac OS X. It’s genuinely good at what it does, but what it does isn’t the same thing as what pip and other Python-specific tools do.

Hence my comment the other day, that it might be worth someone taking a run at this problem from the perspective of enhancing the Python level metadata to instead distinguish the following categories (rather than attempting to tackle the unholy mess that is trying to describe external dependencies in a generic way):

  • provides pure Python packages with no external dependencies:
    • pip install will work fine
    • automated repackaging into other packaging systems should work fine
  • provides binary packages with no external dependencies:
    • pip install of a compatible wheel will work fine
    • automated repackaging into other packaging systems should work fine, but the relevant build time compiler dependency declaration may need to be added
    • when applied to sdists, indicates build will fail if relevant compiler(s) is(/are) not available
  • provides binary packages with embedded dependencies:
    • pip install of a compatible wheel will work, but a platform native build with shared external dependencies will be better in some way (even if it’s just reduced memory usage from shared libraries)
    • automated repackaging into other packaging systems will get you started, but additional effort will be needed to get an actual working package (even if it’s just declaring the missing external dependencies)
    • when applied to sdists, indicates build will fail if relevant compiler(s) and external dependencies are not available
  • no binary package available, attempting to automatically build from source is not recommended:
    • pip install simply isn’t expected to work without careful prior preparation of the build environment
    • automated repackaging into other packaging systems will barely get you started
    • only applies to sdists, indicates build will fail if relevant compiler(s) and external dependencies are not available

While there are other distinctions that could definitely be made, those four are the cases that really matter in terms of avoiding the current situation where an innocent pip install high-level-project ends up pulling in a dependency chain with some level of nightmarish build triggered by a source-only package.

The categorisation on its own wouldn’t immediately solve anything, but it could lay the foundation for UX improvements in many areas (e.g. conda install could confidently delegate to pip install in the first case, let folks opt in to also doing it for the second and third cases, and treat the last case as “you’re going to have to define a conda recipe for this”)

As far as “data analytics” goes, that term exists because there isn’t another term that encompasses the notion of “analysing data sets, regardless of the contents of those data sets or the purpose of analysing them”. It covers everything from quants analysing the stock market to physicists analysing a particle accelerator experiment to biologists analysing a genome to insurance assessors training an underwriting risk assesment engine to AI researchers training a language model to a home user making a chart of how much power their solar panels published back to the grid that week. So yes, it’s an absurdly broad and generic category description, but that’s because a common set of libraries and tools underpin an absurdly broad collection of use cases.

3 Likes

Thanks @ncoghlan: that does lay it out very clearly. It’s really a trick when these terms are so overloaded.

Which, to be clear is:

  • provides pure Python packages with no external dependencies:

And yes, pip install inside a conda environment will be pretty reliable – but still not perfect – because that package could be a dependency of some other conda package, but conda wouldn’t know that it was already there :frowning: – I think better would be for conda to automatically build a conda package and then install that – but that gets to another point you brought up – what is it called? I do wish we’d been more forwar thinking, and developed a naming scheme like pypi_package_name – which would mean this is the SAME package as the pypi package of the same name. But I’m not sure if it would be possible to patch that in on top of the existing pile of packages :frowning:

that is the nightmareish scenario – but the bad dream scenario (with in conda) is that a pop install brings in a pile of dependencies via pip that could have been resolved by conda – and those may override conda-installed packages – it’s a mess.

I actually think the opt-in should be for all cases, but that’s up to conda (or an individual conda-user) to decide.

Actually, my complaint is that the term is too narrow, not too wide. e.g. I develop an oil spill fate and transport model written mostly in Python – is that “data analytics”? Anyway, conda, as an environment manger that “provide[s] other non-Python things” can be very useful for stuff that has nothing to do with any of this – it would nice if that message was clear.