Pip/conda compatibility

(Note for board admins: it may make sense to split this out to a new thread, since it’s less about “a one-size-fits-all tool might be nice to have” and more “there are still some genuinely rough edges in conda/pip interopability that we might be able to do something about without having to tackle the Gordian knot that is detailed platform dependency declarations”)

@pf_moore @rgommers @pzwang @jezdez @encukou @hroncok The various references in this thread to the mechanisms that Linux distros use to communicate preferences to pip and other Python installers prompted me to go find a good link to explain what those are to readers that aren’t already familiar with them. I couldn’t find one, which seems like it should be a fixable problem:

Both of those are relevant to the conda/pip interoperability question:

  • if conda is currently providing .dist-info/RECORD files for the Python packages it installs, then dropping or renaming them would be one way to get pip and other tools to always leave conda-installed packages alone (however, I don’t know enough about conda’s mechanics to know if that might cause other issues, in which case the “split installation & import path” approach might be preferable)
  • conda environments may want to implement a Linux-distro style split between the install location used for curated distribution packages and the install location used by Python-specific tooling (this means Python-specific tooling will at worst shadow the conda provided packages, and potentially not even that if the conda managed directories are given precedence over the Python-only ones)
  • at a specification level, it’s worth considering whether the EXTERNALLY-MANAGED spec should be enhanced to allow runtime providers to select a “no native package installation” middle ground between “allow any packages (default)” and “disallow all packages (EXTERNALLY-MANAGED exists)”
4 Likes

The official docs do cover conda here: Overview of Python Packaging - Python Packaging User Guide

That page as a whole is also the one that gives the broadest guidelines for choosing between different software distribution options: Overview of Python Packaging - Python Packaging User Guide

It’s also worth keeping that page in mind when folks are pointing out that there are plenty of packaging use cases that conda doesn’t address but Python-specific tools need to handle (or at least play nice with).

(One genuine criticism of that page is that we completely duck the question of “What do we do when wheel’s platform dependency declarations aren’t up to the task?” when introducing the wheel format. That would be a good place to add a forward reference to the “Separate software distribution ecosystem” section, since that’s the problem space where conda is a great tool to reach for)

conda is also listed as one of the recommended tools in Tool recommendations - Python Packaging User Guide

Edit to add: the description of conda in Installing scientific packages - Python Packaging User Guide could use some TLC

Just to note, conda/conda#12245 is currently open (as a result of initial discussion on PEP 704) to add an EXTERNALLY-MANAGED file to Conda’s base environment by default, to avoid pip breaking Conda/the conda installation itself.

That’s an interesting possibility to consider, as a middle ground for non-base environments. However, I would be inclined to think that it would be sufficiently disruptive to enough existing workflows that rely on pip updating packages that were originally conda-installed for various reasons (outdated conda package, updated dependency requirements by PyPI-only package, pip install ., etc.) that it might need to be at least opt-out if not opt-in at install time. There might also be other reasons why it’s a non-starter, not sure.

There has been a lot of relevant discussion about this approach and the other mechanisms in general on the PEP 704 thread, starting roughly here. In general, the Conda folks seemed generally against this option as harming rather than helping Conda-pip interoperability, whereas the PyPA folks there believed it would help.

I’m a little unclear what you mean here by “no native package installation” here. Could you explain further? I.e. by “no native” do you mean no non-pure-Python, no pip, no conda, or something else?

1 Like

Thanks, that’s interesting. I was thinking more like here and here. That is, when there is a page saying “how do I distribute code I wrote/install code someone else wrote”, tell people at that time that they can choose to ignore any or all of the official tools and use something else (not just conda, but poetry, PDM, etc.). Basically my philosophy here is that if we’re not going to unify things, let’s not hide the mess by implying that “the way” to install/distribute packages is just using pip or setuptools.

Maybe this is just a matter of preference, but I think something that rubs me the wrong way about the page you linked to is that it is not framed in terms of user goals. That is fine is a sort of theoretical overview, but in terms of users with things they want to accomplish, it is not very helpful. For instance, a heading like “Depending on a separate software distribution ecosystem” is completely opaque to someone who doesn’t already know whether they want to do that. I’m thinking in terms of things like “I want to distribute an application that people can run without needing to have Python installed” or “I wrote a library that I want to share privately with collaborators without making it available for the general public”.

To be honest, though, looking at that page is just making me want to retract my suggestion altogether, because it reinforces how fragmented the landscape is. :confused:

1 Like

FWIW, the packaging tutorial includes an option for PDM as well as Hatch (the initial selection), Flit and Setuptools. Poetry will presumably be added as soon as they finish implementing support for the pyproject metadata standard (PEP 621) that all the other tools already support. Therefore, I’m not sure how this implies “the way” to distribute packages is using Setuptools but rather the opposite, since Setuptools is not the default shown and PDM is shown equally prominently to it. There certainly other pages that need updating to reflect that we’re in a post-Setup tools-required world, but that one isn’t really it.

I wonder how common this actually is. Conda does not use standard Python packaging metadata (.dist-info) but tracks metadata elsewhere, so upgrading a Conda-installed package with pip would make Conda extremely confused and likely not work (well) anymore. Also since this additional check only happens when pip is touching a Conda package installed after the change, any existing installations won’t be affected since they do have RECORD.

The only scenario I can think of that would be affected, and practically makes sense, is scripts that repeatedly create short-living environments (e.g. CI) that rely on this overwrite, and a conda remove --force command before the pip call in the script would be sufficient to keep things working. Overall, my impression is this may be much less disruptive than you’re anticipating.

Whoops, didn’t notice that “native” was ambiguous there. Your first inference was the right one: a way for the runtime provider to say “it’s fine to install and manage pure Python packages, but let the external package manager deal with anything that requires binary extension modules or other bundled platform dependent components”.

Unfortunately, I don’t think it’s possible with our current metadata to reliably tell if an sdist is pure Python or not without trying to turn it into a wheel, so the UX would still be pretty rough.

This limitation also ties into the point raised in The multiple purposes of PyPI - pypackaging-native about there being no way for a project to indicate that its sdists are being published for redistributor integration rather than for end user installation.

Even in the absence of solving the full platform dependency declaration problem, there may be value in offering a way to categorise the dependencies that the Python level metadata doesn’t cover:

  • pure Python (no undeclared dependencies)
  • undeclared runtime dependencies (e.g. calls external commands via subprocess)
  • undeclared “simple” build time dependency (specifically, C compiler + FFI headers, enough to let self-contained Cython pyx files and other cross-platform extension modules build)
  • other undeclared build time dependency (i.e. much of the scientific Python stack)

The metadata could also provide an indication of the impact of the undeclared dependencies being missing on the target system, with at least “No functionality available”, “Some functionality missing” and “Performance degraded” as options.

If this metadata were stored in the package repositories rather than necessarily being in the sdists themselves it could be retrofitted to existing releases rather than only benefiting new releases, but with the downside that caching proxies and other systems would need a way to pick up the supplementary metadata and keep it alongside the sdist.

Given that categorisation metadata, installation tools would be in a better position to tailor their UX to their target audience (e.g. defaulting to “binary only” for any package with undeclared build time dependencies).

Still far from a trivial problem to solve, but at least easier than the “arbitrary dependencies on external components” problem.

2 Likes

The only general answers that can be given to questions framed that way is “It depends on the specifics of your situation”, so a broad theoretical overview is the most that can be offered in an audience neutral way.

Some folks seem to be labouring under the misapprehension that software distribution is intrinsically easy and it is the Python ecosystem that is making it unnecessarily complicated.

Software distribution isn’t easy - it’s spectacularly hard. There are good reasons why commercial software distributors place onerous restrictions on where and how you run things if you want to actually receive the support you’re paying them for (from “we run it and you access it remotely via an approved web browser” through “only available through particular platform restricted app stores” to “we have approved operating systems A, B, C, and hardware providers X, Y, Z, so use those or you’re on your own”).

The most we can aim for is helping to guide users to their best chance for success. There are definitely still opportunities to provide more off-ramps for new scientific Python users where we ask “Are you sure DIY integration is what you want over a pre-integrated platform?”, but that’s a far cry from claiming that the independent tools are presented as the only option. Sure, the tutorials specifically about those tools only cover those tools, but objecting to that is akin to complaining that a Django tutorial isn’t suggesting that folks might find Flask to be a better fit for their needs

2 Likes

I’m not sure it’s well understood how conda works. It’s actually very simple at base. conda installs exactly what’s in the package – in theory, it knows nothing of what’s actually in there. in practice, there’s a fair bit special casing of pip, but that’s the basic idea.

In fact, conda-build is extremely simple (Im’ sure there’ special case, but in essence:

  • it sets up a bare environment
  • it builds and installs the package with whatever tools you specify (make, cmake, pip, …)
  • it then diffs the environment to see what was added – anything new is bundled into the package.
  • That’s it – it’s not looking for pip metadata, it’s not looking to see if any binaries are installed, it’s not looking to see if libs were statically linked, etc, etc, etc.

This is inherently a very flexible system, but also potentially fragile – it’s up to the package builders to make sure that things work well together (that’s what conda-forge’s “curation” is mostly trying to do). If you mix packages that aren’t compatible, it’s easy to make a big old mess out of it.

Exactly – metadata about a conda package is about the conda package, and only the conda package – its dependencies (which are other conda packages), what files it installs, etc.

However – in practice, most python distributions packaged up with conda use pip to install them in the conda-build process. So they are installed exactly the same way as if they were pip installed – so there is the pip metadata there as well. And this is done so that there can be some interoperability with pip. – if you pip install something inside a conda environment, it will find the dependencies, if they are there.

But since the package was installed with pip (by conda-build) then it looks exactly the same.

I think that’s in impossible task – there is no clearly defined definition. You can statically link (very common), you can bundle a dynamic library, you can rely on a system command, I’m sure there is a never ending number of possible combinations.

I don’t think pip should do anything specific to conda – but it should try to be friendly to external package mangers in general.

Maybe this could be as simple as an environment variable that says “PACKAGES_MANAGED_BY_EXTERNAL_SYSTEM”, and then pip would change its (default) behavior:

Off the top of my head (not thought out at all)

  • not install dependencies by default
  • Give a “are you sure” warning for any install.
  • Not allow virtual environments to be created (thought that’s other tools
  • turn off PEP 704 (if it gets approved)
  • TBD :slight_smile:

This is very similar to EXTERNALLY-MANAGED (NOTE: I couldn’t find that in the pip docs…), but IIUC, EXTERNALLY-MANAGED only applies to the base environment, which is a bit tricky here. Though maybe if the environment tools respected it, we’d be done.

1 Like

Linux distro packaging mostly works the same way - any modifications to RECORD and INSTALLER files are done as part of the repackaging and build tooling. Since they want to prevent installation as well as a uninstallation, PEP 668’s EXTERNALLY-MANAGED marker file is a better option for them.

However, removing RECORD only prevents operations that need access to the complete list of files owned by the package. Hence removing it only prevents uninstallation with Python-specific tools (since that needs the list of files to remove), everything else you might want (e.g. entry points, dependency satisfaction, metadata queries) still works.

On this front, I’m wondering if it might make sense to put the “DIY environment or pre-integrated platform?” question front and centre on the main packaging.python.org home page. There are ways we can frame that question such that it will make sense even to complete newcomers to programming, and it will provide an opportunity to point out that taking the DIY path is orders of magnitude more difficult for data analysis and machine learning use cases than it is for areas like web development and basic task automation (where wheels are readily available and reliable for the projects beginners are most likely to install).

The more specific tutorials could then include brief pointers back to that high level question.

There are certainly also developer workflow scenarios that will be affected. I think this deserves its own thread and a more complete write-up [1]. But as a basic one: say you’re working on SciPy, scikit-learn or another package that depends on NumPy, and want to test a patch to NumPy. Then the workflow is mamba env install -f environment.yml followed by cd /path/to/numpy/repo/with/patch && pip install . --no-build-isolation. This relies on pip re-installing the conda-installed numpy.

I’ll just state here that, beyond adding EXTERNALLY-MANAGED to the conda base env, treating conda like a Linux distro is not appropriate. The installer functionality of pip is widely used, and doesn’t have a good alternative within the conda ecosystem. And that’s maybe partly a gap in conda, but it’s also because pip (like PyPI) does have multiple separate purposes and they keep on being conflated.

That would be quite useful to add indeed.

Thank @ncoghlan, I really like your whole post and this categorization in particular. There will be multiple benefits of doing this. A large fraction of packages on PyPI is pure Python, and those packages are compatible with Python itself and packages with extension modules from other package managers. And no other packaging ecosystem can keep up with repackaging pure Python packages. Making it easier to allow reusing pure Python wheels from PyPI directly will be quite helpful.


  1. I made a start on that and am soliciting input from the conda community at Defining and documenting how Pip should interact with Conda environments - Contributors - Conda Community Forum. ↩︎

1 Like

It’s still not clicking for me why “lots of people pip install
packages in conda managed environments because not everything is
packaged in Conda Forge” is an argument against setting
EXTERNALLY-MANAGED, while the exact same argument about apt managed
environments was not enough to convince Debian’s Python maintainers
not to use it (and now that it’s appeared in the last month or so, a
lot of Debian users are starting to discover they can no longer pip
install things into the system context like they used to, and have
to significantly change workflows by starting to use venvs when they
didn’t need nor want them before).

Maybe it’s that Conda is a more risk-averse ecosystem than Debian?
Maybe it’s because the Conda maintainers cater to scientists and
believe them less capable of adapting to new workflows or design
compromises than systems administrators are? I just keep seeing
people say the solutions catering to Linux distributions won’t work
for Conda, and then give reasons which sound precisely like the
reasons not to use them in Linux distributions either.

Perhaps it’s a matter of degree, and the user base of Conda is less
tolerant of change than the user base of Linux distributions, or the
maintainers of Linux distributions are more comfortable making
breaking changes which impact user experience (in this case
short-term pain in service of longer-term
sustainability/maintainability), but if so that’s not coming across
clearly in explanations.

My understanding is that conda is considering using it in the scenario that’s most analogous to a Linux distro: the base environment where conda runs. That way you couldn’t use pip to break conda itself, but would still be free to mix & match in conda envs (just as Linux users can still do as they wish in a venv).

The idea @rgommers was objecting to was my suggestion to make pip “additive only” inside conda envs by removing or renaming the RECORD file in conda-installed Python packages (probably at install time, so all existing packages would be protected without needing to be rebuilt).

I still think the idea is worth considering though with a process like the following:

  • by default, “conda install” renames any RECORD files in dist-info directories to “RECORD.conda-managed”
  • a new “–allow-pip-upgrade” option would skip the renaming step

That way packages would be protected by default, but opting out would be straightforward in cases where it wasn’t desired (since no information is lost, it’s just moved to the side where other tools won’t look for it).

3 Likes

Are you referring to EXTERNALLY-MANAGED for the base environment, or for all Conda environments? That makes a fundamental difference as to the implications of what you’re suggesting (and the fact that such a question must be asked is itself arguably the biggest difference between Linux distros and Conda in this particular context).

As mentioned previously, there is an open issue for enabling EXTERNALLY-MANAGED by default for the Conda base environment (with an opt-out for sysadmins at install time), (conda/conda#12245, that has general agreement from both the (PyPA and Conda) folks involved, with me having proposed it, @pradyunsg having opened the issue and @rgommers and @jezdez both having commented in agreement. The only question is how to best handle existing installations to balance rolling it out to existing users while avoiding breaking environments with existing pip-installed packages. Therefore, given there’s already clear agreement in favor of it among those who’ve spoken up, it’s unclear what more the above is asking for with regard to base.

Conversely, applying it to non-base environments (at least by default) is a non-starter, for the same reasons it does not apply to venvs created from a Linux distro’s system Python, which are the closest (and really, effectively only) Linux distro equivalents to Conda environments for this purpose. If users need to use non-Linux distro packages, they create a venv; likewise, if users need to use non-Conda packages, they create a Conda env (and ideally install everything above python itself with pip, or install just the tip of the stack pure Python packages with pip after installing the rest with Conda)

If you deliberately block the latter with EXTERNALLY-MANAGED, then the large fraction of users who at some point need at least one version of one package that was not packaged on their Conda channel of choice, or needed to install their own packages, or do an editable install, or install from GitHub, etc. have no options except to figure out how to pass a big scary flag every time they do, which means they either get accustomed to it and are much more likely to use it in base too, which is far more dangerous, or they just give up out of frustration and can’t use Conda at all.

I see, so the objection to using EXTERNALLY-MANAGED in non-base
environments is that Conda users see venvs as an unnecessary extra
layer of isolation (akin to the arguments against requiring them in
container images)? Otherwise in theory a venv made with
–system-site-packages would work the same in a non-base Conda
environment as in an unprivileged Linux distro shell environment,
I would think.

So it seems like the objection is that it would mean additional
overhead and complexity, not that it would actually break. (I don’t
count breaking workflows in this, clearly that is something on the
table since Linux distros have agreed to breaking user workflows in
the same ways.)

For Conda much more than just arguably unnecessary or at worst a mere minor annoyance for some use cases, like venvs in container images, but rather actively harmful and would indeed cause a lot of eventual breakage—the aforementioned PEP 704 discussion goes soemwhat into that. Having a venv on top of a Conda envs means:

  • Users will always need to remember to activate both layers of environment isolation instead of one (and possibly even in the correct order), and if they don’t they’ll get unexpected, broken and likely confusing results (unlike with Docker, where once their image is spun up there’s only one layer, venv, and there is no way to activate the latter without the former). Given a large proportion of Conda users are non-programmers—scientists, engineers, data analysts, students, etc.—for which it is already hard enough to teach them to create, activate and use a conda environment (which is a single, simple cross-platform command), its much more difficult to ensure they remember to not only do conda activate env-name, but then also always remember the much more unfriendly and platform-specific method of activating a venv, this introduces an order of magnitude more complexity and failure modes.
  • The existing Conda-pip interoperability mechanisms won’t function, since the Conda env has no real way of determining whether some arbitrary venv may or may not be activated on top of it
  • Duplicate or conflicting packages can much more easily be installed on the Conda side
  • Conda cannot detect much less offer to fix any incompatibility, missing dep, conflict or other problem that might arise (as it can now)
  • If the Conda env is removed, renamed, recreated, the Python version upgraded/downgraded, etc. (all a fairly routine and recommended occurrence with Conda) all associated venvs will silently and perhaps irreversibly break
  • The current situation is AFAIK better than it used to be to the point where at least basic venv/virtualenv is usable for at least simple cases (it apparently took a bunch of work by the Conda folks to even get that working), but there’s historically been lots of bugs reported against it and I have no idea how stable it is in the many edge/corner cases.

On a related note, looks like Kali Linux just enabled EXTERNALLY-MANAGED and now their users are flooding the CPython issue tracker…

I feel like I need to repeat my clarification from the other thread, since we’re now confused over here.

The Conda “base” environment is in no way equivalent or analogous to the venv “base” environment. In Conda, base just happens to be the name for “the Python runtime dependency required by the conda tool”.

Users should only ever touch this if they’re updating Conda (or adding an extension). It’s not intended for normal use. Hypothetically, if Conda were to be rewritten in a different language, the base environment would no longer include Python, because it would cease to be a dependency of Conda.

Any other Conda environment is entirely standalone from the one Conda is running in. Every Conda environment is analogous to a virtual environment, and none “inherits” anything from any other environment.

This is why it (a) makes very little sense to create a venv on top of a Conda environment - you already have all the isolation you are asking for - and (b) makes no sense to prevent users from installing into their Conda environment by default - they are not externally managed, they are entirely managed by the user.

So please, stop confusing the Conda base environment. It’s only ever brought up because Conda has historically not warned their users away from it well enough. That’s not in any way a PyPA issue to fix.

3 Likes

Indeed, I think this discussion does not need to involve anything about conda’s base environment. It’s a problem that we’ve either (a) got no control over, (b) provided all the relevant mechanisms to deal with, or (c) this isn’t the current forum to complain/discuss this at.

1 Like