The 10+ year view on Python packaging (what's yours?)

BrenBarn · September 6, 2023, 4:44am

You may be underestimating what a package manager has to do. In particular, if it’s going to do dependency resolution (and people will want it to), then it may have to solve rather difficult mathematical problems that will indeed be CPU-bound.

I don’t think that is true. The reason is simple: many, many end-users/developers will eventually be in a situation where they have an old project using Python 3.X, but also start working on a new project that needs (or uses a library that is new enough to need) a later Python 3.Y. And they don’t want to upgrade the Python version in the old project, because that will require updating a bunch of libraries as well, and they don’t want to disturb the working configuration they have.

That situation is unsolvable without Python version management. Moreover, even if you don’t absolutely need to have the later version for the second project, you might still want to get it, or just to try something out, without having to screw up your carefully-prepared dependency stack in existing environments.

It’s the same reason people use venvs in the first place: they want to have independent environments, so they can, e.g., try upgrading a library in one environment to test it with a given project, or because different projects have incompatible requirements. All of those situations apply just as well to the Python version.

I see it is as much better in the long run to shift to regarding Python as “just another dependency” — because, well, it is a dependency. When you have a piece of Python code, Python is part of what it needs to run, and there are constraints on what versions of Python will work, just as there are for other libraries. It’s only the limitations of current tools that make us accustomed to managing Python separately from all the other things in the environment.

It may not be needed in such cases, but having it available does no harm. Moreover, in the long run, I don’t see any reason why Python provided by the OS, or by python.org, should not be actually a manager that manages Python. In the case of Linux distributions, for instance, I think this would go a good way towards alleviating the problems we currently have with “the system Python”. If “the system Python” is just an environment among many, then it becomes much easier to explain to people that they should create an environment for their own work that is separate from that, and it becomes easier for them to do that as well, because the tool to do so is already built in.

That is potentially true, and I agree that that might be the rare case where the manager might not be used. However, as you say, such platforms are niche. The vast majority of cases will not require such special handling. Plus, people who have to deal with such situations will usually already know they’re likely to have to do some extra work to get things working (e.g., compiling Python themselves).

uranusjr · September 6, 2023, 4:59am

Oh wow, I didn’t know there’s people using it in production. Do you know how ABI compatibility is for the bundled compiler? I’ve been wanted to make more use of this as a better build backend solution (my POC) to how setuptools detects and uses a local compiler, but did not due to worries on compatibility since CPython is obviously not built with the same compiler.

ofek · September 6, 2023, 5:03am

I haven’t delved too deeply into that yet to give an accurate answer since we only target a specific minor version of Python at any given moment.

uranusjr · September 6, 2023, 5:19am

I see, thanks. At least I know it works for one person on one C compiler version against one Python build, a 100% increase from what I knew yesterday. I assume you’re on Linux and a GCC-compiled Python?

ofek · September 6, 2023, 1:07pm

Yes that is correct but actually there was a commit on the PR where I forgot to gate the patch by CentOS 6 and it also passed on the other platforms.

BrenBarn · September 14, 2023, 5:58am

There’s another difference between pip and conda that I find notable but keep forgetting to mention in these threads. When you install or remove a package, pip only calculates the “forward” dependencies (i.e., what is needed to make the package you specify work), but conda also considers the “backward” dependencies (i.e., when you do what you asked for, will anything else stop working). Conda tracks the entire environment state, but pip only considers the packages you’re asking it to deal with.

For instance, if package A depends on package B, and you install A, both pip and conda will install A and B. But if you do pip uninstall B, pip will do it, leaving A in a broken state.^[1] Conda won’t allow this; if you do conda remove B, it will show you the plan of what it’s going to remove, which will include removing A, because it knows that A depends on B and thus won’t let you break the environment by removing B without also removing A.

The same also applies when changing package versions. If for whatever reason you upgrade or downgrade a package, conda will force you to update versions of other packages so that the overall dependency graph is consistent. That is, if you upgrade package B (the dependency) to 6.5 but the version of package A that you have installed has specified it depends on B<6.0, conda will make you either upgrade A (if there is a newer version that can handle the newer version of B) or remove it.

Few will be surprised to hear that I think the conda way is better. In my experience a lot of frustration that people have with Python package management comes from getting into a situation where they try upgrading, downgrading, or removing packages for some reason (e.g., to match versions available on a server they want to use), and then find that in doing so they’re put the environment into a broken state by getting the dependencies out of sync. Ensuring correctness of the entire dependency graph on every package change operation prevents this; the only way you can break the environment is if a package itself is broken or its metadata is incorrect.

There are a couple downsides. One is that you have to resolve the full dependency graph on every change, which has become notorious as a problem with plain conda: if you accumulated a lot of packages in an environment, trying to install or upgrade one could take a long time because conda has to make sure it can find a way to do what you asked without breaking some other package that may be quite far away in the dependency chain. This has been most egregious in the case where what you’re asking to do isn’t possible (e.g., because of conflicting dependencies), as then conda can go on wild goose chases trying all manner of odd versions of things, trying to give you what you asked for. With mamba and the libmamba solver, this problem seems to be mostly alleviated.

Depending on how you look at it, another downside is that it becomes harder to do a “manual override” and force the installation of a package that whose declared dependencies are incompatible with a given environment state. Conda essentially treats the package metadata as gospel and makes it very hard or impossible to say “I know this package says it won’t work with the versions I have installed, but trust me, it will, go ahead and install it anyway”. Overall, though, this seems to me a case where it makes more sense to go with the approach that will help more people. Most users should stop if they encounter that kind of situation, and fall back to another option (e.g., downgrading all involved packages to older versions that are known to work together), rather than try to force an install.

Most importantly, I think a baseline user expectation, one that’s worth catering to is “whenever I ask the manager to do something, and it finishes and tells me it did it, the environment should be in a working state”. In other words it’s not just a package manager, it’s an environment manager. It manages the environment as a whole and makes sure that at all times other than in the middle of an operation, the environment is in a consistent, working state. Pip doesn’t satisfy this criterion; when you tell it “uninstall numpy” it just tells you “okay”, and doesn’t tell you that in doing that you broke pandas.

I was curious about Poetry so I took a look. What I found is basically that this situation highlights that poetry is primarily not a package manager, nor an environment manager, but a project manager. Poetry won’t even let you install any packages at all without having a pyproject.toml. Whenever you install something, it not only installs it in the environment but updates the pyproject.toml to list it as a dependency. If you poetry add pandas and then try to poetry remove numpy, it will say it can’t even find numpy, because numpy isn’t part of the declared dependencies of your project. So in this sense poetry also manages the environment as a whole, and goes beyond that in keeping the environment in sync with a project file like pyproject.toml.

Personally I’m not as much of a fan of that, as I like to keep a few “general purpose” environments to play around in, rather than having to start with a named project. I also tend to develop by incrementally building on a small core that arises from that playing around, and I prefer if I can do almost all of the development before thinking at all about packaging the results for distribution. If I had to create a dummy project called something like “sandbox” that wouldn’t be the end of the world. As I understand it, though, Poetry really wants everything in the project to be in a single directory subtree, which would preclude having a single environment that’s used for multiple experimental nascent “projects”.

What’s most interesting to me here is the difference between package management (pip), environment management (conda), and project management (poetry). My preference is for environment management: ensure the environment is a consistent state, and layer project management on top of that, rather than requiring a one-to-one mapping between projects and environments. Ideally, though, that project management would be done by a tool that is integrated with the environment management tool (e.g., something like conda createproject blah).

My example for this is pip install pandas followed by pip uninstall numpy, which is not the most convenient one since those libraries are fairly large, but is the one that comes most easily to mind. ↩︎

pradyunsg · September 14, 2023, 7:55am

Take installed packages into account when upgrading another package · Issue #9094 · pypa/pip · GitHub is the tracking issue for changing this.

hansgeunsmeyer · September 22, 2023, 4:20pm

No, definitely not! The wasted time for me is all the time I’ve had to spent over the years dealing with packaging matters – from missing wheels to failing wheel builds to pip bugs/broken environments due to pip bugs or due to pip not caring about conda/dependency hell + having to basically learn all the little differences between various ways of building a project…

As app or library developer, I do not want to have to spend any time on complicated packaging matters.
I’m convinced this can be made simple, but in general it’s a like a labyrinth.

Personally I think that the packaging discussion is the most important discussion going on at the moment.
My own ideal is also pretty similar to @jeanas – I think we should

consolidate on one config file (pyproject.toml) + one lock file (no longer support the kind of current mess with various .ini files, setup.cfg, pyproject.toml, setup.py for extension modules)
get completely rid of setup.py
build and packaging should be entirely data driven by the config file (+lock file)
installs should always check for the transitive closure of all dependencies, so preserve the internal version consistency of an environment

So, perhaps one big question is whether a package manager (builder/installer) should also be an environment manager. I think it needs to be. This does complicate things a little, of course

hansgeunsmeyer · September 22, 2023, 4:22pm

Indeed. I also share that sentiment

I think the main focus in this discussion should be:

preserving consistent, working environments.

The second one:

simplifying/standardizing the packaging process.

brettcannon · September 22, 2023, 6:22pm

Please keep the tone cordial. I don’t think you can blame pip for conda compatibility entirely nor dependency issues when pip when the volunteer team of contributors doing the best that they can.

hansgeunsmeyer · September 22, 2023, 7:16pm

The link Take installed packages into account when upgrading another package · Issue #9094 · pypa/pip · GitHub that @pradyunsg posted will hopefully go a long way to alleviate what I personally consider to be the main issue, but it seems no one has volunteered to take that on yet… (I am also considering if I could contribute…)

I don’t think my tone was really non-cordial, btw. I think it’s healthy when people sometimes express certain frustrations - which is what I did. I definitely did not want to diss or dismiss the work of devs working on pip - In fact, I believe pip has been steadily improving the last few years. But I also think there is still a long way to go, especially since pip is advertised as “the” official Python installer (PEP 453). That creates high expectations…