Packaging forks

Let’s say I want to distribute my version of a package foo, namespaced with my own PyPI nick, e.g. mrmino.foo.

Ignoring the fact that pip does not support it yet, would it be considered correct to add a Provides-Dist: foo to the package metadata in that case?

I’m looking into how I could make my PRs available via PyPI before they are merged, and ensure they don’t collide with whatever is in the original package.

1 Like

I belive so yes. I think it fits in the following part of the spec, though it is not super clear.

From PEP 345 -- Metadata for Python Software Packages 1.2 | Python.org

A distribution may provide additional names, e.g. to indicate that multiple projects have been bundled together. For instance, source distributions of the ZODB project have historically included the transaction project, which is now available as a separate distribution. Installing such a source distribution satisfies requirements for both ZODB and transaction.

1 Like

Yes, I believe that’s the intent of the field.

Having said that, I wonder if the field itself is such a good idea. With recent focus on attacks involving packages pretending to be other packages, is an official way to say “I am requests” a good idea? I suspect that if you can exploit this, you probably already have enough access to manage a more effective exploit, but it’s probably something we should consider before it gets implemented in tools…

1 Like

Ideally pip would always prompt user for confirmation when there is an alternative package providing a distribution and that package happens to be the only one that resolves the dependency requirement. If I were a pip maintainer I would probably refuse to implement a --no-confirm option or something like that, and instead reinforce that you should be using lockfiles for production, but if users really wanted to do a pip install I would maybe add a --choose-distribution distribution-name=implementation-package option.

As someone with packaging experience in other ecosystems, I think having this is possibility is very helpful, and that the burden is on the package manager implementations to think about it and make sure they implement it in a non-footgun way for users, if that’s in their scope.

1 Like

An important difference is that anyone can publish on PyPI but that’s not true for many other ecosystems, like the ones managed by Linux distros.

Is there any prior art for any anyone-can-upload-to-default-index ecosystem, that support this mechanism?

Given the way pip works, it will never discover these alternative implementations. There is no database it can consult to find all packages which “provide foo”. The only way such a package would ever be installed is by the user explicitly asking for it.

Wrt. my intended usage the situation is already worse. If I want to provide my own version of an open source project, currently my mrmino.foo would just break the environment of the person who wants to install it over foo to test it out. I don’t think that the lack of Provides-Dist support stops me from making my package replace files of some other package.

Example: paramiko-ng has to resort to some nastiness with sdist installation managed by environment variables in order to make the installed package stand as paramiko.

Exactly. The only edge case I can think of in which this would be difficult to manage is when my mrmino.foo and foo were tried to be installed in the same pip call, e.g. as transitive dependencies. I guess pip could just refuse to do that?

Yes, currently there isn’t, but I feel like PyPI should have an API for that.

Not just currently, it likely won’t ever. I thought about how Provides-Dist could be implemented in pip’s resolver, and the conclusion I reached was all alternative packages must be specified by the user. This is not only best for security, but also the only way (I can think of) to make the behaviour predictable to the user.

Take pillow vs pil for example, if a package pkg1 requires pil, pip install pkg1 will install pil by default. There are only two ways to trigger the pillow replacement:

  1. pip install pkg1 pillow (order is insignificant).
  2. pillow is pre-installed into the environment before pip install pkg1.

In either case, the replacement package (pillow) is controlled by the user, so pip does not expose a possibility to supply chain attacks. If another package pkg2 requires pillow, pip install pkg1 pkg2 will pull in both pil and pillow without either replacing the other and probably fail since the installation won’t work. The user will need to do pip install pkg1 pkg2 pillow instead to trigger distribution replacement.

Anyway, just want to say there are ways to avoid the feature becoming a supply chain issue. There are still a lot more other problems around Provides-Dist to be designed correctly (e.g. how does the replacement package’s version work with the replaced). Not to mention having the resource to implement it is a whole other problem.

2 Likes

Maybe this could be worked out in a satisfactory manner by having the index show alternatives on the package page instead?

1 Like

Yes, I think this is a good idea. Though I still think an API would be reasonable, even if package managers will very likely not want to use it, per what @uranusjr explained.

2 Likes

Would it be feasible to try to introduce a limited support for it in the form of “whatever is required to make forks work”?

Could you elaborate a bit? I don’t quite understand what you mean.

Users can discover directed edges between packages related by the Provides-dist: package attribute through PyPi in order to ________ (?)

https://en.wikipedia.org/wiki/Many-to-many_(data_model)

https://en.wikipedia.org/wiki/User_story

I mean a minimal, restricted implementation that would get Provides-Dist working for packages that are meant as direct substitutes (forks).

Let’s say I want to have forked_foo as a substitute for foo (in the same way paramiko-ng is a substitute for paramiko).

The support would mean that a package forked_foo would be effectively treated as foo in dependency resolution, so that installing one removes the other, dependencies on foo are resolved by having forked_foo, etc.

In order to make the scope of this as strict as possible, the implementation would only trigger when all of the following conditions are met:

  • Provides-Dist is given once
  • The field has a version specified
  • The field has no environment markers specified
  • The package is not being installed as a transitive dependency, i.e. only directly via command line arguments or the target of -r.

In all other cases Pip could ignore the field and (maybe) show a warning if a package has a Provides-Dist field but the conditions above are not met.

That way there’s something to start on without the danger of having to chew more than we can eat, and each condition can be relaxed one by one afterwards, when the wrinkles have been ironed out.

1 Like

How would I know that I’ve accidentally installed packages that replaced the blessed satisfying package instead of dumbly shadowing said package without uninstalling said package automatically at all?

From TODO (other thread re: software supply chain security i.e. DevOpSec); I don’t think we can solve this without having - minimally - (1) per-package-release keyrings; (2) that are “pinned” as trusted.

More completely, per-package, there could be at least [known] commit, merge, and release keyrings

To prevent this, you can store per-platform/architecture hashes in a pip requirements file; or use e.g. Pipfile.lock to cache/trust the hashes for all platform/architectures that that constraint/version would install. But those aren’t signed.

LD-Signatures is one very portable way to sign JSON-LD (any RDF). There may be a way to use GPG keyrings with ld-signatures if appropriate in addition to: the TUF signatures, and the GPG support that was removed from warehouse where it may still be possible to upload an .asc?

Without adding per-package manifests with (signed) per-file checksums, is there yet any way to detect or prevent package shadowing with or without Provides-Dist specified?

What is the current behavior with these requirements.txt and/or e.g. install_requires?

pip
pipZ #(Provides-Dist: pip)
pipZ #(Provides-Dist: pip)
pip

Sorry, I don’t understand how is this connected with Provides-Dist for forks.

I think as long as the rules outlined from my earlier post come into play, this issue becoms irrelevant to this discussion. User would have to willingly install a fork for this to trigger, and even in case of typosquatting it is still better to have pip explicitly uninstall the package that is being substituted, rather than have it shadowed implicitly which is hard to detect.

I can still release a django-fork that substitutes files in django, with or without Provides-Dist, that hasn’t changed, no?

UX-wise pip list could show which distribution substitutes what, so it’s obvious to the user and they have a way to check it. This is one way in which such implementation would improve transparency.

In fact, I think that having this field implemented properly opens perspectives for pip to start refusing to install packages that would lead to path collisions. You would have a way of specifying when substitution was intended, you can then treat the rest of these cases as malicious, and those that do specify it are easily detectable. Wouldn’t it improve the situation around supply chain attacks?

What is the current behavior with these requirements.txt and/or e.g. install_requires?

AFAIK this field is currently completely ignored.

There’s a very obvious complication that Provides-Dist creates for installers: how do they track/determine the reverse of “Provides-Dist”?

While installing “numpy”, if you want the installer to look for a different package with a Provides-Dist, then the installer needs to go through the entire package set at least once (unless it’s using an additional database to keep an easier-and-faster-to-access form of this data). This means every installer becomes even more complex.

I’m not too sure how I feel about the complexity this adds to the whole system TBH (okay, instinctively, I don’t like it but I can’t pinpoint why). FWIW, the key exists right now, and folks are welcome to try experimenting with it. If someone has a good (read: concensus that it’s a good) plan here, we can adopt it! :slight_smile:

It may be easier to discuss a specific set of requirements in order to determine what the current behavior is and what Provides-Dist support would add when this requirements.txt is installed with pip install -r requirements.txt:

# requirements.txt
pip
pipZ      # Provides-Dist: pip
pip-fork  # shadows pip (site-packages/pip)
  • We’re certainly not suggesting that every time I pip install pip I need to do reverse lookups of packages that Provides-Dist: pip?
  • When I pip install pip, pip does not record which packages were installed by which command when (a log of package operations) and pip does not check for file collisions (for an intersection between the list of installed package files and the list of to-be-installed package files).

It would also be great to be able to determine from which index a package was installed through pip list; and then whether the hashes and signatures match a key in a keyring for those requirements maybe in pip check?

In order to determine that:

  • pipZ is installed instead of pip (which was uninstalled because Provides-Dist)
  • pip was installed from a different index server
  • pip is a development release
  • pip-fork writes files into site-packages/pip/plugins/
  • pip-fork overwrites existing files in site-packages/pip that were installed by pipZ
  • the checksums of the files in site-packages/pip match those in the pip .whl ZIP manifest
  • the pip .whl is the package that was uploaded to PyPI (TUF)
  • the pip .whl is the package that PyPA devs released (GPG? ld-signatures? {commit, merge, release} keyrings)

Are these necessary or sufficient:

  • Pip must create a JSON install log that includes:
    • the pip CLI command
    • the pip config
      • configured or specified install destination paths (*)
    • the RequirementSet
    • the InstallRequirements (including the requirements.txt source text)
    • the expected and calculated hashes, and the keyring that was trusted (if that becomes possible to do with Python packages)
  • Pip and/or python packages must check for package file collisions (given past and specified install destination paths)
    • Wheels are ZIPs; which do have a manifest and CRCs

unless it’s using an additional database to keep an easier-and-faster-to-access form of this data

My idea was to utilize synthetic .dist-info directories with that information, such that there’s no possibility of having both original package and its substitute installed at the same time.

So the forked_foo 1.0 with Provides-Dist: foo (1.2.3) would add both forked_foo-1.0.dist-info and foo-1.2.3.dist-info, but the latter with additional information that helps in reverse lookup.

If someone has a good (read: concensus that it’s a good) plan here, we can adopt it! :slight_smile:

This thread is basically me eagerly probing for whether there could be a concensus on this :stuck_out_tongue:. I’m also thinking about creating a separate tool just to test it out, for the sake of making it more concrete.

Edit:
Oooh, wait, you mean determining what forks are available from given index for a given package? I wasn’t talking about this kind of a lookup. What I want to achieve is proper local resolution.