Trying to learn how to manage dependencies, stuck at conda vs pip

Benji · May 5, 2022, 7:50am

Background

So far in relatively simple Python projects, so I only used Anaconda for installing packages and never needed dependency management.

Needs

Recently, I found some merit in mixing in some R via rpy2, and I’m also looking at Python packages that need pip for installation or to be built from source. Easy conda install/imports don’t seem good enough to keep a project stable anymore. My ignorant intuition is that I need:

explicit specification of dependencies and their versions
automatically check for and fix conflicts among such dependencies and their own dependencies, give me manual control of version editing if necessary
make a copy of the versions specification that I can edit to update dependencies and see if the code breaks
automated dependencies update mechanism, with conflict resolution of course

What I’ve read so far

I’m pretty sure package managers and virtual environments are how I should do this, but I’m really confused about what to use. I feel like I’m awash with many sources on how to use several specific tools, but I lack general concepts on the why and how of installing and managing dependencies to make sense of it. Here’s some questions I have:

Many sources say that pip only handles Python code and conda can track many languages, and that this difference is important for packages using multiple languages e.g. Scipy. However, pip’s PyPI has Scipy, so what’s the accurate story here?
There seems to be a lot more packages on PyPI than Anaconda’s registry, so that’s one point in favor of pip. But what do people do when they need to mix languages , or build packages from source?
Why is it hard for a package manager to manage a package not in its registry? For instance, I don’t know why I can’t just download a Python package and let conda track it like any other package, minus the simple commands for installs and updates from the registry.
When Anaconda lacks a package, a suggested fallback is to use pip within a conda environment, but it comes with tons of warnings of conda failing to manage version changes and conflicts, sometimes breaking apps like Spyder. Why can’t one package manager seamlessly accept packages installed by another package manager for the same language? Shouldn’t the package itself contain metadata both managers can use?

uwe · May 5, 2022, 11:59am

One of the things at first, is maybe to clear up some terms about (Ana)conda as I will use them later in my reply:

Anaconda is the company that mostly develops conda and also provides a vast set of curated packages in the defaults repository.
conda is the package manager that you are using when you are talking about “anaconda”.
anaconda and miniconda are both installers for conda-based environments. anaconda comes with a wide package selection is probably only useful if you want to quickly start. If you want to create more environments and work a longer time with your conda setup, you should go with the miniconda distribution which only contains a minimal set of packages in the base environment.
mamba is an alternative implementation of conda that speeds up some parts (e.g. the requirements solving process)
conda-forge This is a community-maintained distribution, i.e. it provides a vast selection of packages like defaults. In contrast to defaults, it contains a magnitude more packages but also is run community-based and not curated by a single vendor.
miniforge and mambaforge are basic installers for a conda setup with conda-forge as the default source. mambaforge is miniforge with mamba pre-installed.

Many sources say that pip only handles Python code and conda can track many languages, and that this difference is important for packages using multiple languages e.g. Scipy. However, pip’s PyPI has Scipy, so what’s the accurate story here?

scipy contains C++ and Fortran code. The PyPI wheels also contain the third-party binary dependencies like openblas inside the wheel. The conda-based installations of scipy track the third-party dependencies of scipy separately. This means that scipy and numpy don’t install their own BLAS copy in the conda case but also that you can use the package manager to select different BLAS implementation (openblas, MKL, Apple Accelerate) without the need to re-compile scipy.

There seems to be a lot more packages on PyPI than Anaconda’s registry, so that’s one point in favor of pip. But what do people do when they need to mix languages , or build packages from source?

To get a package into Anaconda’s registry, you need to reach out to them. If you want to have a conda package for a package on PyPI, you can though contribute it yourself to conda-forge by making a pull request to GitHub - conda-forge/staged-recipes: A place to submit conda recipes before they become fully fledged conda-forge feedstocks. For most packages on PyPI GitHub - conda/grayskull: Grayskull - Recipe generator for Conda generates the conda recipe perfectly for you, some might need minor adjustments. Only complex projects like tensorflow need a bigger effort on the conda(-forge) side.

Why is it hard for a package manager to manage a package not in its registry? For instance, I don’t know why I can’t just download a Python package and let conda track it like any other package, minus the simple commands for installs and updates from the registry.

When Anaconda lacks a package, a suggested fallback is to use pip within a conda environment, but it comes with tons of warnings of conda failing to manage version changes and conflicts, sometimes breaking apps like Spyder. Why can’t one package manager seamlessly accept packages installed by another package manager for the same language? Shouldn’t the package itself contain metadata both managers can use?

conda can do this partially but not perfectly. This is due to the fact that the metadata of packages from PyPI is less detailed in some cases than it would be as a conda package and also in some cases, packages have a different name in the conda ecosystem than they have on PyPI, e.g. because they clash in their name with a (more popular) C++ library.

Shouldn’t the package itself contain metadata both managers can use?

PyPI only has Python related metadata, conda knows more things like third-party dependencies, minimal glibc version, CUDA availability, …

Also, be aware that PyPI is proving source and binary packages whereas conda is purely a binary package manager.

CAM-Gerlach · May 5, 2022, 4:46pm

To avoid Spyder breaking due to installing packages for your own use, you can use our standalone installers or create a separate conda environment for Spyder for Spyder (or run Spyder from base), and then use a different working environment for your own packages and code, which you can create with conda and select from the list under Preferences --> Python interpreter in Spyder.

notatallshaw · May 7, 2022, 12:47am

The problem with this solution is that Spyder does not activate the conda environment of the other Python interpreter when running the user code.

Because conda provides binary shared library dependencies as part of the conda environment this means that python libraries can break using this methodology because these dependencies are not on the PATH until the conda environment is activated.

I currently support an Anaconda distribution in a large company with 1000s of installs and the advice we’ve found most pragmatic is tell users to install their dependencies in to a conda environment specific for their project and then once happy run conda install spyder in that environment and use that version of Spyder with keeping the setting as the default Python interpreter.

CAM-Gerlach · May 7, 2022, 3:43am

At least in modern, supported versions of Spyder, it should activate the environment and user code should run normally, just as if the working environment happened to be the one Spyder itself is running in, and if it doesn’t, that’s a bug which we likely can and should fix.

We’ve improved this substantially over the past year or two, and despite working with numerous environments and some pretty hefty and finicky binary dependencies all running outside of Spyder’s own (Numpy/SciPy/Matplotli/Pandas, GDAL/PROJ4/Shapely/Geopandas, Qt stuff, HDF5/netCDF4, etc) I can’t recall running into a case recently where user code worked in Spyder’s runtime env that didn’t in another env, at least not due to the type of issue you’re describing.

However, if you’re still experiencing an issue running user code in the latest supported versions of Spyder + Spyder-Kernels (from Conda-Forge, pip or the standalone installers), we’d be happy to take a look. Please report them to our issue tracker along with the full details requested there, and we can take a lot. You can @ me over there if you like (same username).

notatallshaw · May 7, 2022, 5:05am

Our base Spyder is stuck on whatever the latest version that comes with the Anaconda distribution, which is currently spyder 5.1.5 and spyder-kernels 2.1.3. And I definitely have users often report this as an issue if they try to use Preferences --> Python interpreter.

I will recommend that they try creating an isolated Spyder version off the conda-forge install by doing conda create -n spyder_env spyder -c conda-forge and launching that version and see if it fixes their issues.

Thanks for the pointer.

Benji · May 7, 2022, 12:15pm

So what I’ve gathered so far here is:

PyPI packages can handle code from other languages just fine with wheels (still need to learn more), but conda is able to manage them more independently, which would be useful in large multi-language projects using many libraries that might be updated often.
Incompatibilities of pip and conda is down to metadata differences, mix in isolated expendable environments with backups

After some more reading, people seem to be saying that conda is more usable (more robust dependency resolver, easier to update dependencies including Python itself), but I feel like I have to go with PyPI for having more packages, including a couple I want. I don’t plan on using a lot of rapidly updating non-Python dependencies, either. But before I dive into the pip/venv/virtualenv/pipenv rabbit hole, does anyone have more comments on how conda users can use packages not in the defaults or conda-forge repositories, such as PyPI packages? If it’s not actually a limitation in practice I might stick with conda instead.

AA-Turner · May 7, 2022, 12:32pm

python -m pip install blah

Sometimes conda goes a bit funny, and might not be able to properly track the package for updates etc, but I’ve never had a genuine problem, and I use conda (miniconda) as my main environment manager.

If something goes catastrophically wrong, it’s also fairly easy to remove and recreate the environment with environment.yml files (these files can also have a ‘pip’ key to install from PyPI). Just make sure not to mess up the base environment, that’s much harder to recover from!

A

CAM-Gerlach · May 7, 2022, 8:27pm

There’s no need to use venv or virtualenv (and pipenv is a whole different ball game, as its more akin to a tool like Poetry); you can still use pip just fine with conda environments. For that, just install Python itself with conda in a fresh environment, e.g. conda create -n pip-env python=3.9, then install everything else with pip, which will work as normal.

In fact, if you know what you’re doing, you can actually get away with mixing pip and conda packages just fine, particularly if they are pure-Python and near the top of the dependency chain, so long as you’re careful and follow a few rules:

Never install anything into your base environment; just use it for conda itself. I suggest using Miniconda (or better yet, Miniforge) instead of full-fat conda.
Either use Miniforge, or run conda config --add channels conda-forge to use the conda-forge channel, which has a much wider selection of packages than the default AD channels.
Use pip installs only for packages that need it; even if you have a package that can only be installed by pip, typically most or all of its dependencies will be available through conda-forge, so you can manually install those first with conda (you might need to look at the package’s pyproject.toml, setup.cfg or setup.py to find them), then install the package itself with pip.

maximlt · May 8, 2022, 11:40am

I’d suggest having a look at anaconda-project, which is a command line tool to manage a project with conda.

Benji · May 9, 2022, 5:54am

With multiple people saying that I can contain any buggy pip-conda mixes to an environment and can recreate a working one easily, I’m going to try to use conda first. I bet I’ll have more questions as I learn it.

For now I do have a question about this:

How would the pip-installed package know to use the conda-installed dependencies? I assume there must be a limit to how well conda can handle pip-installed packages, otherwise there wouldn’t be problems mixing them, but I don’t understand what the limits are.

CAM-Gerlach · May 9, 2022, 8:49pm

Without going too deep into the technical details, at runtime (e.g. in your Python code, when importing and using the package), packages installed by pip or conda can usually work together just fine, since they have the same standard metadata, are installed in the same place, are imported using the same name (which may or may not match the exact name of the distribution package you installed with pip or conda) and have the same basic structure. so Python can read them.

If you, say, installed a package spam using pip, so long as compatible versions of its dependencies were installed via conda (with versions that package supports, and without major binary incompatibilities if spam is a heavy duty scientific package like numpy or pytorch), everything should run fine (barring corner-case issues).

At install time (e.g. when installing, updating or removing the package), conda packages provide all the metadata pip is expecting and when installed have essentially the same on-disk structure, so pip can see that a compatible version is installed, and even (nominally) update and remove them. However, in order to do its job, conda uses different additional metadata on top of what pip uses, so while it will detect that packages you uninstall/update with pip are no longer installed by it, it will not see them as properly installed (which, to it, they aren’t) and attempt to reinstall missing/incompatible dependencies, unless you tell it otherwise (Note all this was tested on conda 4.8, which may not be true of more recent conda releases that I haven’t yet performed the same in-depth testing of).

steve.dower · May 9, 2022, 9:00pm

Already plenty of good advice in here, particularly this post.

All I’d add is a bit of additional clarity: anything that you can install with conda, you should, and do it first. Then do anything with pip, and watch the logs to make sure it doesn’t upgrade/uninstall any existing packages. That’s when things break (it shouldn’t do this these days, but keep an eye on it until you’re comfortable).

If you specify too many versions too precisely, your installs will break. Try to either be really flexible, and use conda env export to get “locked” versions. If you’re using environment.yml files, you can list dependencies to be installed with pip in there and conda will do it for you.

ChrisBarker-NOAA · May 10, 2022, 3:43pm

Thanks @steve.dower: you beat me to it.

A few more notes:

conda-forge has many more packages, and more up to date than “defaults”. I highly recommend using it.

NOTE: “other languages” is a tad confusing – e.g. part of scipy is written in Fortran, but it is a Python package, so you only need Fortran to built it, not to use it (why the binary wheels on PyPi can work). But conda can manage stuff that has nothing to do with Python: R, Julia, etc (as well as Python itself) I use it to manage Node.js, and MongoDB, for instance.

If everything you need to use is available via pip you may be just fine sticking with pip and the virtual environments it supports.

In short (my personal recommendations)

Use conda if you need things that can not be installed via pip
Use miniconda, rather than Anaconda, if you don’t want the GUI Anaconda provides
Use Environments! Even if just one, so you don’t accidentally make a mess of the base environment
Install anything you can via conda before installing via pip
- for example, for my complex projects, I have a conda_requirements.txt and pip_requirements.txt file, I do conda first.
add the conda-forge channel – it will really help.
don’t be afraid to rebuild your conda environment whenever anything changes – it works much better if you pass the entire set of requirements in at once, rather than adding things one at a time.

conda create -n my_env_name --file conda_requirements.txt

(That last point was a game changer for me!)

CAM-Gerlach · May 10, 2022, 6:13pm

Or you can use Miniforge to kill two birds with one stone

Benji · May 11, 2022, 5:08am

Well I was never a fan of conda update --all taking so long in the base environment of the Anaconda distribution, and I never really bothered learning the GUI, so yeah I think I’ll start with a trimmed version and learn to install just what I need in new environments.

How reliable is Miniforge? From what I could read, it’s basically Miniconda but it’s community-managed, uses conda-forge by default, and was created to support more architectures. But is that really all? A few questions:

How well-maintained is it compared to the company-run Miniconda?
Really no usability differences? Like do they have the same access to channels? It wasn’t hard to do conda install -c conda-forge ... when I needed to, and maybe I might need to install something from the defaults channel sometimes.
Miniforge’s Github page lists “x86_64, ppc64le, and aarch64 including Apple M1”, but this Miniconda installation page already lists all those. What’s up with that, did the existence of Miniforge prompt Miniconda to support these recently?

steve.dower · May 11, 2022, 3:53pm

As far as I understand, the only difference between Miniconda and Miniforge is the default channel. Otherwise, they’re both going to contain conda based on the same sources (potentially different versions, depending on timing, but it’s updateable).

The main difference between conda-forge and Anaconda’s channels are the level of support you can pay for (none, and lots, respectively).

If you’re happy with volunteer-maintained build scripts run on public CI using freely available libraries, conda-forge is just fine. If you want/need to know they were built on isolated infrastructure and using paid-for libraries and compilers (i.e. they run faster), you’ll want Anaconda’s repository. Anaconda’s repository also tends to move a bit slower than conda-forge.

Due to how conda resolves binary dependencies (better than pip, to be clear), mixing Anaconda’s packages with conda-forge’s packages doesn’t always work well. If you need to use some things from conda-forge, it’s likely you’ll end up with almost everything coming from conda-forge (because e.g. the metadata says that a compiled module is only compatible with conda-forge’s Python build, and not Anaconda’s Python build), so you probably want to default that way.

So basically, it comes down to:

Are you broke? Use Miniforge/conda-forge
Not broke, but don’t need any help? Use Miniforge and send NumFOCUS a donation for conda-forge
Making real money from using it? Talk to Anaconda

CAM-Gerlach · May 11, 2022, 5:13pm

Indeed, and you shouldn’t ever use it in such. Instead, if you are using base in full-fat Anaconda, the correct approach is to run conda update anaconda, which will update everything to the latest release of the anaconda metapackage, which is both much faster and more reliable, since everything is tested to work together and their versions are specified exactly, so the solver doesn’t need to churn through a huge range of possibilities.

I’d have to look closer, but they might be referring to the CF channel that Miniforge defaults to supporting more architectures than AD official, or other work on improving instability on unofficial architectures.

There are a few other important differences to note (some of which are alluded to later in the post):

CF contains a much larger number of packages, a near-superset of the packages available on AD (except for those that aren’t FOSS) and as @steve.dower mentions, you’ll want to avoid mixing channels if at all possible, so you should go CF from the start if you need a CF-only package
You can (relatively easily) add your own or others’ packages to CF, which allows you to avoid mixing conda and pip in the first place.
While it used to be less, AD can lag a significant amount behind CF these days, which can be a problem in the fast-moving PyData ecosystem. For example, AD is still stuck on Spyder 5.1.5, whereas CF is up to date with Spyder 5.3.1, which (among a number of other things) includes compatibility fixes for breaking changes in versions of IPython released after Spyder 5.1.5 was out.
Use of the AD channels falls under the AD ToS, so you’ll want to purchase AD Enterprise if you’re using them at any kind of scale (which is also required to net you most of the real benefits of AD, the support and services).

These issues, which are not all that different (if not quite as serious) as those we’ve experienced with distro packages, have led to lots of user complaints for us (Spyder), unfortunately, which is one reason why at least for volunteer support for us, we now officially recommend our own standalone installers or using CF instead of the full-fat AD defaults channels, though we (and they) still maintain compatibility with a variety of other installation methods (AD, pip, distros, WinPython, Fink/MacPorts, Binder, source, etc).