Idea: Allow installation of package local dependencies

notatallshaw · January 16, 2022, 4:12pm

I was going to post this to Python-Ideas but I guessed this has probably already been discussed in the past, if so could someone please point me to that discussion. Also if this is the wrong place to post this I’d appreciate if you could give me guidance on where to post it.

The high level idea is to allow a package to be installed locally for another package. e.g. say you are installing package A and B and they both depend on package C, but they depend on conflicting versions of package C, you could have the option that two versions of C are installed, one that is visible to A and one that is visible to B.

The motivation for such issues is maintaining very large Python applications in a pip 20.3+ world has become very complex. For example the full version of homeassistant installs over 1000 dependencies, the full version apache-airflow installs over 450 dependencies. A lot of time and effort is put in to not having conflicting transient dependencies that is often largely beyond the application owners control. This limits applications being written in Python to leverage Python’s large ecosystem of high quality packages.

This could be implemented in a number of ways and probably would need language level changes to be implemented cleanly. I’m wondering though if there is support for the high level idea or if there is some fundamental opposition to it.

FYI this came to my attention because of how npm handles such a larger ecosystem: How npm install Works Internally? - DEV Community

sinoroc · January 16, 2022, 4:15pm

Maybe related to this discussion:

Allowing Multiple Versions of Same Python Package in PYTHONPATH

pf_moore · January 16, 2022, 5:00pm

Summary of that thread: It’s been tried before, and didn’t work very well. No-one is particularly enthusiastic to have another go.

But do read the full thread - it’s entirely possible that you’ll have a different perspective and may be able to bring something new to the table.

McSinyx · January 16, 2022, 5:47pm

I’m wondering […] if there is some fundamental opposition to it.

To be clear, I have no opposition to the (hypothetical) proposal
of the ability to import different package versions from a same project:
there exists inevitable conflicts if a library introduce a huge number
of backward incompatibility. However, workaround/convention for this
also exists: giving it a different name, which is kind of obvious
because the new version hardly does the same thing as the previous one.

I’m more worried that if such import system was implemented, vendoring’d
be abused by developers and cause a negative net effect to the ecosystem.

The motivation for such issues is maintaining very large
Python applications in a pip 20.3+ world has become very complex.

pip 20.3 did not introduce anything that makes package maintenance
any more complex. All it did was enforcing the requirements,
some of which could have been ignored by the previous (more naïve)
dependency resolver. The complexity is with the packages themselves.

For example the full version of homeassistant installs over 1000
dependencies, the full version apache-airflow installs over 450
dependencies. A lot of time and effort is put in to not having
conflicting transient dependencies that is often largely beyond
the application owners control.

450 or 1000 is a giant number, and as you said, often largely beyond
the application developers’ control. What happens if a vulnerability
is found in a library 2 or 3 layers down? It’d be fairly cheap to
update to the fixed version in a downstream repository (similarly
if one is relying on PyPI and does not pin dependencies but it’s not
a good idea for security as monitoring is passive). It’d be impossible,
or at least very expensive, for the library’s upstream to patch
for dozens of versions used in the wild. Think of log4j.

I believe that Python packaging alone should not be used for
end-delivery, but only during development. I strongly suggest using
your favorite OS’s repository for the former use case instead.
This way, packages are well integrated with each other and fixes
are available to everyone. It often requires upstream developers
to collaborate for the compatibility with new library versions,
but I don’t see anything wrong with that.

I don’t use proprietary operating systems but for macOS there are nixpkgs
and homebrew (the former even ensure reproducibility) and for Windows
there is chocolatey I think. I’m working on an Python-specific
downstream repository called floating cheeses,
but it will need a lot of help to cover even the most common projects.

This limits applications being written in Python
to leverage Python’s large ecosystem of high quality packages.

I’m nitpicking here, but a huge number of dependencies does not equal
high quality. It doesn’t mean that the reverse is always true,
please just don’t confuse size with quality.

notatallshaw · January 16, 2022, 6:12pm

I’m not sure the distinction you’re making?

Managing dependencies is now more complex and depending the size of requirements takes a significant amount of manual effort to try and get a working solution. For example homeassistant has not yet been able to migrate to pip 20.3+ and pins older versions of pip.

While it is not by any means the main motivation, I would argue the current situation compared to the proposal is worse for security. When you have a large number of transient dependencies you can not easily update packages, potentially leaving you stuck on vulnerable versions of transient dependencies.

Imagine this hypothetical: You depend on packages A and B and they depend on package C. A vulnerability is discovered in package C. Package A updates it very quickly because it uses it in a way that exposes the vulnerability, but package B uses an older version of C and can not easily migrate and does not even use C in a way that would expose the vulnerability so it is not a priority for them. Now you are stuck on a vulnerable version of C.

If you could update C to a non-vulnerable version for packages as soon as they support it, rather than waiting for every single one of your dependencies to update, then you reduce the potential attack vector for the vulnerability in a quicker time frame.

This doesn’t seem practical to me. I don’t know an OS repository that supports even a small percentage of the Python ecosystem, further I can not imagine OS vendors remotely interested in supporting application dependencies for large applications like apache-airflow where much time and effort has to be put in to making sure transient dependencies don’t conflict, and releases multiple times a year.

Do you have any example where a Python application with a large number of dependencies is able to successfully use an OS repository for end-delivery? I would be very interested in their methodology.

I was trying to say that there are a large number of high quality packages in the Python ecosystem not that just because you have a large number of dependencies they are all high quality. Apologies I wasn’t clear enough here.

pf_moore · January 16, 2022, 6:44pm

This is because the old version of pip lets you install incompatible packages without telling you it did so. So pip 20.3 isn’t “making it harder”, so much as reporting to you the bugs that were already there.

And yes, I guess you could argue that your testing has demostrated that the older pip’s behaviour is fine - but do you really have enough tests to cover all of the immense number of interactions between 1000 packages you depend on? Remember, if package A says it’s not compatible with B 1.0, and you use A with B 1.0, the resposnibility is on you to test that combination, because A’s maintainers won’t be testing it for you…

But yes, you’re to some extent right - this is a hard problem, and the packaging ecosystem doesn’t support it that well. But I’d say that applications with 500+ dependencies is a pretty rare edge case, and expecting general tools to support it perfectly is a bit of a stretch. (Of course the problem is that no-one has written tools targetted specifically at that sort of use case, so you have to make do with general tools in their absence).

fungi · January 16, 2022, 6:45pm

[…]

Do you have any example where a Python application with a large
number of dependencies is able to successfully use an OS
repository for end-delivery? I would be very interested in their
methodology.
[…]

The global requirements list for OpenStack is over 500 entries in
length, with a global constraints list (transitive dependency
lockfile) which has roughly 25% more entries than that. OpenStack
releases new versions of its software twice a year, with updated
dependency lists each time, and the majority of it is packaged in
the Debian GNU/Linux distribution.

The upstream community for OpenStack embraces and works closely with
distribution package maintainers in an attempt to keep its
dependency set current and manageable, with a particular eye toward
avoiding redundant or inconvenient dependencies while sticking with
popular, supported options when faced with a choice of several
possible libraries to serve a given purpose.

CAM-Gerlach · January 16, 2022, 8:09pm

Spyder is packaged in all the major distros, and it has about 200 total direct or indirect dependencies, most Python packages but also binary libraries as well (though a large fraction of them are for optional functionality and it will work just fine without them). That being said, we don’t maintain or officially support them; the downstream distro packagers do, and they are often out of date and users sometimes run into issues with them so we don’t recommend them. We haven’t

For some more background, our preferred delivery mechanisms are our standalone installers on Windows and Mac, and Anaconda on all platforms (in which Spyder is included by default). We also unofficially support a number of alternative mechanisms for delivery, including being packaged with WinPython on macOS, Fink/MacPorts on Mac, and (as mentioned) distros on Linux, as well as install via pip or from source on all platforms (with some work).

We only officially support the former, to ensure users have a reasonably consistent set of packages; we’ve never had a real problem getting a working solve (and Conda, one of our primary delivery mechanisms, has always had a strict dep resolver like pip just got); besides the lower number of packages, we help maintain a lot of our main dependencies and we and the other packages in our ecosystem (PyData) collaborate pretty heavily to ensure interoperability and maintain a consistent pace of development.

In practice Spyder is built to work with a reasonable range of dep versions, though we try to push users toward Anaconda and especially the standalone installers as they help guarantee a consistent set of up to date deps that have been been relatively thoroughly tested together. If anything, our big problem with pip (and a big advantage of conda for us) was until recently was that it didn’t have a dependency resolver to either ensure users get a compatible set of deps, or else fail early before something breaks downstream.

notatallshaw · January 16, 2022, 8:38pm

I think I phrased my original post wrong. I wasn’t trying say pip specifically is making it harder but rather I was trying to say that pip 20.3+ has exposed the problem of managing a large number of dependencies (which can have an even larger number of transitive dependencies). And this is an idea for a possible solution.

I’ve skimmed over it once and now I’m slowly reading it in detail. From skimming over the post it seems to somewhat confirm my suspicion that to implement it in a way that doesn’t cause more headaches it will require the import mechanisms of Python to be changed or extended somewhat. But I will think about this some more.

notatallshaw · January 16, 2022, 9:11pm

Yes I know Spyder as I maintain the commercial Anaconda install at the company I work for, so a lot of internal support questions about how to set Spyder up to work in certain ways gets sent to me. A lot of people greatly benefit from your work thanks!

Your experience with Python applications that are distributed by OS channels matches mine, all the dependencies tend to be very out of date.

I’m curious how much effort you find managing the dependencies for your build system?

I get the impression that that Anaconda spends a lot of resources maintaining it’s base environment and this is not an easy or cheap thing for them to do? Maybe I’m wrong but being able to push out a new update to the number of packages they maintain in their base environment 2-3 times a year seems to be a large part of their commercial value and competing with that would take non-trivial resources.

To be clear I’m not at all criticizing pip’s decision to stop installing conflicting dependencies. It’s the only thing that makes sense, and the more time I’ve spent on this the more I’ve been surprised that applications with a large number of dependencies ever worked with pip < 20.3.

CAM-Gerlach · January 16, 2022, 11:42pm

Thanks! I don’t work that much on the core Spyder application anymore, at least at least right now, beyond high-level design and UX input, long-term strategy, and occasional PR review, bug fixes, etc. Right now I mostly run the docs, website, theme and some of the upstream dependencies like QtPy, Docrepr, etc. Its a lot of people’s volunteer and funded contributions that have made Spyder what it is!

It depends on exactly what you include, and I’m not really the best person to ask since I’m not super involved involved that aspect, at least not nowadays for the Spyder code; Daniel does the Windows installers and Ryan does the Mac ones, and Carlos (the lead maintainer) keeps an eye on the overall deps. We’re usually involved in the discussions beforehand, if we’re not the maintainers ourselves, when our main deps have an important/breaking change and plan accordingly. If there’s something we don’t catch, our CIs do and we fix it. Then we just freeze the latest deps from our CIs on each platform in our installers (which are now also CIed/CDed in our CIs to continuously test and build them, as I understand) and we’re good to go. But again, I’m not the expert in this area, I mostly do library stuff.

Yeah, exactly. Its somewhere in between a distro like Debian, where everything is tested and must work together for years and is hard-locked to upstream versions, and a “rolling release” like pip where things go live immediately. While it does mean Spyder updates to the defaults/anaconda channel, much less those in the actual packaged releases, are slower, it also improves stability of the environment in general, and is why we usually recommend Anaconda users who are on defaults to only upgrade with conda update anaconda rather than updating individual versions, though nowadays with our standalone installers and focusing more toward conda-forge that’s changed somewhat.

Well, from your side you don’t need to ensure that every dependent package does everything it is supposed to do for its users; all you need to ensure is that your application works properly with one set of dependencies on the platforms you support, which presumably your unit/integration/functional tests, tox config and CI/CD jobs already do. Since you seem familiar with Anaconda, perhaps Conda Constructor would be worth a look?

Certainly, so many deps there is a substantial likelihood that one had a bug or incompatibility, but in general application developers, the distros and other downstream users tend to catch such bugs and report them early, which gets them fixed for everyone. That’s the blessing and the curse of open source development, really.

EpicWink · January 17, 2022, 1:00am

“Dozens” of versions is a little exaggeration. The vast majority of a time when a library has multiple used major versions^[1], there’s only two major versions, and very rarely three or more. However, when a bug or security flaw is found, the amount of effort is usually way more than double to see if the previous version is affected, and then port and test the fix to the previous version.

Given this, I would argue the Python packages ecosystem is safer and more reliable with no multi-version installs, as volunteers only have to think about their latest version. This means consumers of the library need to keep up to date with their dependencies to have the latest security and reliability fixes, or to solve second-order conflicts. I expect projects with many dependencies have more resources to do so.

I’m assuming semantic version, where a major version indicates backwards-incompatible API changes ↩︎

pradyunsg · January 17, 2022, 3:04pm

The final post in Allowing Multiple Versions of Same Python Package in PYTHONPATH has the current state of affairs on this. Quoting myself:

Beyond that, I don’t think we’re contributing anything new to that discussion here.

And, yes, the new resolver is stricter which is why 20.3+ is going to be more painful. And, also, the older resolver still prints out that it installed with conflicts (see PR 5000 on pip).

Realistically, the only option you have if you really want to have conflicting dependencies is running pip with --no-deps, which opts out of all dependency resolution. That makes it clear that you’re opting into whatever solution for the dependency hell that you’ve determined works for you.

notatallshaw · January 17, 2022, 3:37pm

Well if pip had the option to handle dependencies like npm then there would not be dependency hell and in that case no need to spend any time backtracking to find a solution.

But it seems fairly clear reading the past thread this would require a Python level change to implement sanely, not just a fancy trick with installer tools.

pf_moore · January 17, 2022, 3:47pm

Yes it would, and I’m not honestly sure the result would be “sane” even then. The sort of issues involved are deep within the basic design of Python’s import system, and it would be a significant amount of work to change without massive backward incompatibilities.

At this point, this isn’t really a packaging question any more, but rather a core language design discussion.

pradyunsg · January 17, 2022, 6:21pm

And, for that sort of thing, poking python-ideas with a reasonable plan would be the first step. I’ll just say this: Don’t get your hopes up and expect that people will say “no” to it, in a lot of words.

pradyunsg · January 17, 2022, 6:23pm

There still would be dependency hell, actually. See the following post for the details:

notatallshaw · January 17, 2022, 6:51pm

It’s a good example.

I think any plan would need to include some language level ways to import specific versions, get runtime guarantees on the versions of objects (isinstance(arr, version="numpy>=1.17")), provide type hinting options for the versions of objects, and have be able to express packaging requirements that say something like “Package A depends on package C, and if package B depends on C it must be the same version”

Which as you and others have stated would be a be and complex change to Python. Thanks for all the feedback.

McSinyx · January 17, 2022, 3:40pm

It’d be impossible, or at least very expensive, for the library’s
upstream to patch for dozens of versions used in the wild.

“Dozens” of versions is a little exaggeration. The vast majority
of a time when a library has multiple used major versions^[1],
there’s only two major versions, and very rarely three or more.

The context was a hypothetical scenario where multi-version import
were possible. I might have intended to exaggerate it, but it’s not
far fetched given how many packages are restricting dependencies
to minor version or worse these days (take a look into the first few
hundreds of most popular ones on PyPI).

Not every very semver-looking version is semantic, but the real problem
with semantic versioning is that a lot more people do not trust them.
I strongly believe that multi-version import would quickly turn into
a foot gun as it did to other ecosystems like JVM.

I’m assuming semantic version,
where a major version indicates backwards-incompatible API changes ↩︎

McSinyx · January 17, 2022, 4:42pm

Imagine this hypothetical: You depend on packages A and B and they
depend on package C. A vulnerability is discovered in package C.
Package A updates it very quickly because it uses it in a way that
exposes the vulnerability, but package B uses an older version of C
and can not easily migrate and does not even use C in a way that would
expose the vulnerability so it is not a priority for them. Now you are
stuck on a vulnerable version of C.

If you could update C to a non-vulnerable version for packages as soon
as they support it, rather than waiting for every single one of your
dependencies to update, then you reduce the potential attack vector
for the vulnerability in a quicker time frame.

In a downstream distribution, usually one version of C will exist
(unless C is shared among many many dependees and there are breaking
changes). Updating just C and you’re set; this would not be possbile
if say A wants C~=4.2 and B wants C~=4.3. This works without ever
needing to manually investigate if B is affected.

This doesn’t seem practical to me. I don’t know an OS repository
that supports even a small percentage of the Python ecosystem,

An end-user repo only need add packages upon demand; unlike package
indices adding upon they are written.

further I can not imagine OS vendors remotely interested in supporting
application dependencies for large applications like apache-airflow
where much time and effort has to be put in to making sure transient
dependencies don’t conflict, and releases multiple times a year.

It’d help if instead of thinking them as vendor, think of them as
volunteers working to serve the end-users. You and I could and should
be one of them. The rule of thumb is a package is supported when
there’s enough users asking for it or someone is willing to step in
and do the work. Large corporations like Apache could make sure their
packages are availble in every distro, but in decide to spend resources
on their own delivery infrastructure instead.

I strongly believe that effort should be spent in collaboration instead.
Even from the security standpoint, it would be a nightmare having to
trust every upstream developer of my 100+ applications to carefully
manage every single transitive dependency, assuming that I know
and trust them personally in the first place. That’d simply be
unrealistic, not to mention the waste of effort each of them need
to spend for common dependencies.

Do you have any example where a Python application with a large number
of dependencies is able to successfully use an OS repository for
end-delivery? I would be very interested in their methodology.

I’d like to interest you in tensorflow, scancode-toolkit, sourcehut,
ihatemoney or mitmproxy. Mostly what they do in common is not over-
restricting dependencies’ version and thus make them more friendly
to downstream.