PEP 710 - Recording the provenance of installed packages

Pinging this thread again since there hasn’t been any pushback on the suggested changes, if we get those incorporated and then push the PEP draft to be review-ready does that seem like a good plan? @fridex if you’d like me to add the updates to the PEP I am happy to, let me know.

There is one pending PR I need to revisit (will do ASAP). If you have time, feel free to contribute - I’m more than happy if people want to get involved.

Also, if there are any other comments to be incorporated, please feel free to raise them (also others).

My point here is not to re-use PEP 610, but rather the direct url data structure specification, which is a standalone page, that is independent of PEP 610.

For instance the parts about user:password and hashes could be dropped and the specification part of the PEP reduced to something like “a direct URL data structure, restricted to Archive URLs”.

I think I insist on this partly because your prototype implementation in pip reuses the direct_url implementation, and if we intent a common implementation to be used, then there are benefits to having a single spec, to make sure spec and code will stay in sync if and when the spec evolves, or avoid problems when we change PEP 610 related implementation details that would happen not to be compatible with PEP 710.

Alternatively we may consider that reusing the direct URL data structure is overkill for PEP 710, because all we need here is an archive URL and hashes. And have a completely independent code base for PEP 610 and PEP 710. In that case, it may be interesting to depart more radically from the existing data structure, to simplify and avoid potential confusion. For instance, the archive_info field may not be necessary here as it was needed in PEP 610 to discriminate various kind of URLs (archives, vcs, local directories), which is a requirement that is not present in PEP 710 that needs to supports only one kind of URL.

@sbidoul I noted this in the GitHub PR as well, but if we’re going to reference the Direct URL data structure and we want an additional field index_url it may require updating the Direct URL data structure to have that field? I believe having index_url is a good idea, it distinguishes between mirrors that are serving their own content versus forwarding along to upstream and captures user intent more clearly.

If Direct URL data structure handlers are resilient to additional new fields I think it may be worth updating the structure so we’re not reinventing more formats? The Direct URL archive type meets almost all of the needs of this PEP already.

+1 on including the index_url field, seems like a valuable addition.

The prototype reused the Direct URL implementation in pip, nevertheless, I’m not sure whether we should be influenced by this implementation detail. We could diverge from the Direct URL data structure if it makes sense.

Hi all, I am the author of PDM, called here by @sethmlarson

I did not read all the posts in this thread, I just have some questions about index_url. It seems to be added in the last revisions of the PEP, quoting:

The value of the index_url key MUST be a base URL of a Package Index used to download the given distribution package, such as https://pypi.org/simple/ . The recorded URL SHOULD point to a repository compliant with PEP 503.

However, for packages installed from --find-links, it is not specified what this field should be.

Say a package is installed via --no-index --find-links=https://myorg.com/packages and the url is https://myorg.com/packages/foo-0.1.0-py3-none-any.whl, should index_url be https://myorg.com/packages? It is not a PEP 503 simple repository. But since the modal verb used here is SHOULD, is this allowed?

In this PR it doesn’t implement index_url either so I am lost here.

2 Likes

Thanks for joining the conversation @frostming! :pray:

Hmm! That’s a good case to handle, I don’t have as complete an understanding as others but it makes sense if --find-links has different behavior than a PEP 503 index? Maybe the field should be omitted in these cases, if so how would the verbage of the PEP change?

Yeah the PR you’re linking to was created before the index_url field was added, the PR will need to be updated. @fridex and I chatted recently about this :slight_smile:

I can explain the differences between index_url and find_links

Say you have specified --index-url=https://myindex.com and --find-links=https://mylinks.com and run pip install foo, then pip will look for the package links for foo in the following locations:

  • https://mylinks.com/ (from find links)
  • https://myindex.com/foo/ (from index urls)

So find-links is different in that it doesn’t append the package name part to the URL. Therefore, if in index urls case the base url without the package name should be recorded, there is no equivalent for find links, it is okay to omit it.

As a side note, both index urls and find links allow local paths or URLs starting with file://. That may be worth considering and clarification in the PEP.

Good point, thanks for bringing this up.

Considering this from consumer’s perspective, I’m not sure whether we should mix those two. We should probably record index_url only if the given distribution comes from the specified index. Maybe we could introduce find_links key as well which will be set if the given distribution comes from the location specified using --find-links (defaults to null otherwise). Or, if this will complicate usage and acceptance of this PEP, drop both.


Assuming we want to keep index_url and introduce find_links:

  • the following can be the content of provenance_url.json file for scenarios when installing the given distribution using pip install micropipenv (or with an explicitly provided --index-url):
{
  "archive_info": {
    "hash": "sha256=257ded4ea1fafa475f099e544b2d7560f674d42917e096d462e8a46a64f51245",
    "hashes": {
      "sha256": "257ded4ea1fafa475f099e544b2d7560f674d42917e096d462e8a46a64f51245"
    }
  },
  "index_url": "https://pypi.org/simple",
  "find_links": null,
  "url": "https://files.pythonhosted.org/packages/07/44/46967147557e45a01d13e8c96836d733f799a82f568e8387048caea0f4ac/micropipenv-1.7.0-py3-none-any.whl"
}
  • Another case when installed using pip install --find-links=http://localhost/page.html --no-index:
{
  "archive_info": {
    "hash": "sha256=257ded4ea1fafa475f099e544b2d7560f674d42917e096d462e8a46a64f51245",
    "hashes": {
      "sha256": "257ded4ea1fafa475f099e544b2d7560f674d42917e096d462e8a46a64f51245"
    }
  },
  "index_url": null,
  "find_links": {
     "url": "http://localhost/page.html"
  },
  "url": "https://files.pythonhosted.org/packages/07/44/46967147557e45a01d13e8c96836d733f799a82f568e8387048caea0f4ac/micropipenv-1.7.0-py3-none-any.whl"
}
  • And also when using pip install --no-index --find-links=page.html micropipenv (with a file):
{
  "archive_info": {
    "hash": "sha256=257ded4ea1fafa475f099e544b2d7560f674d42917e096d462e8a46a64f51245",
    "hashes": {
      "sha256": "257ded4ea1fafa475f099e544b2d7560f674d42917e096d462e8a46a64f51245"
    }
  },
  "index_url": null,
  "find_links": {
     "file": "/path/to/page.html"
  },
  "url": "https://files.pythonhosted.org/packages/07/44/46967147557e45a01d13e8c96836d733f799a82f568e8387048caea0f4ac/micropipenv-1.7.0-py3-none-any.whl"
}

This way we would be able to track provenance and see what was the actual source configured when installing. However, my point here is not to complicate things unless there is added value. We could also add timestamps or hash of the file when --find-links points to a local file and so on.

Considering the above, do we still find value in keeping index_url in the provenance_url.json? For the provenance use case, would it make sense to keep just url field in these scenarios and omit index_url and avoid confusion with--find-links?

Excuse the drive-by comment, but what would be recorded for an installation that was from a local file/directory - something like pip install . or pip install foo-1.0-py3-none-any.whl? In those cases there really isn’t a URL the package came from (unless you treat it as if the user had said --find-links . which isn’t precisely true, but is probably the closest you’d get…)

This is covered by PEP-610 - in such a case, there is created direct_url.json (mentioned also in PEP-710):

$ pip install ~/Downloads/micropipenv-1.7.0-py3-none-any.whl
$ cat /tmp/venv/lib/python3.11/site-packages/micropipenv-1.7.0.dist-info/direct_url.json         
{"archive_info": {}, "url": "file:///home/fridolin/Downloads/micropipenv-1.7.0-py3-none-any.whl"}
2 Likes

As no concerns were raised, I’ve updated PEP and removed the index_url key. I will review the discussion this week and see if anything else is missing to proceed with this PEP. Also, finish implementation in pip. If you have any ideas or concerns, please feel free to raise them.

2 Likes

According to the installation report documentation, ArchiveInfo is allowed to have no hashes stated:

  • For source archives, download_info.archive_info.hashes may be absent when the requirement was installed from the wheel cache and the cache entry was populated by an older pip version that did not record the origin URL of the downloaded artifact.

I’ve adjusted PEP to explicitly require at least one hash in the provenance_url.json file. If a wheel is installed from pip’s cache and built using older pip, it is encouraged for users to rebuild the wheel to have also at least one hash available. See the related PEP adjustment, feel free to raise any concerns/comments.

Hi all – I haven’t read the full thread here, but I did read through the PEP and don’t anticipate any issues implementing / supporting this in uv.

Well, there’s one hitch, which is that we don’t compute the hash of downloaded wheels unless the user runs with --require-hashes or similar (in which case we compare them to the hashes reported by the registry or lockfile).

So, I might selfishly prefer that hashes could be empty, and we could just populate them when the user runs with --require-hashes. But, we could consider changing our behavior – it has other benefits too, of course, to always record a hash. (We do store computed hashes in the cache, so we can always write them to a provenance file on install afterwards – that part is not a problem.)

1 Like

Thanks for your valuable feedback.

Well, there’s one hitch, which is that we don’t compute the hash of downloaded wheels unless the user runs with --require-hashes or similar (in which case we compare them to the hashes reported by the registry or lockfile).

So, I might selfishly prefer that hashes could be empty, and we could just populate them when the user runs with --require-hashes. But, we could consider changing our behavior – it has other benefits too, of course, to always record a hash. (We do store computed hashes in the cache, so we can always write them to a provenance file on install afterwards – that part is not a problem.)

This PEP does not enforce creating provenance_url.json file but rather suggests creating it. If the provenance_url.json file is created, it requires to state at least one hash though.

As the primary use case for creating provenance_url.json file is security, would it make sense to always record a hash? uv can omit provenance_url.json file, unless --require-hashes is provided. Would this logic make sense and work for you?

I noticed in a pull request there were concerns that this proposal could be problematic for uv’s performance. I’m having a hard time imagining how this could have a performance impact and understanding how it relates to our cache. Could you expand on the concerns that were being discussed there?

Thanks for commenting here uv team!

To maybe sway your thoughts: one of the goals in my mind for PEP 710 is moving towards having every Python environment be verifiably reproducible. Unconditional storing of hashes means that a Python environment itself can be a “source” to create a lock file or Software Bill-of-Materials document. Having hashes in addition to a URL also makes it possible to link artifacts that were installed from a mirror back to their original source for purposes of software identity.

3 Likes

I wonder if @pf_moore was referencing what Charlie raised in calculating and storing hashes, but tagging them to make sure that’s the case and there’s not more?

2 Likes

Yeah, I think it’s very likely that we move towards hashing all artifacts that we install, so I don’t view it as a huge problem for us. It also lets us do content-addressed caching and other nice things.

Yeah, that’s very reasonable. It’s a bit of a shame that we would be unable to record the originating URL due to a lack of a hash, but it’s not a strong objection from me. It’s more that I would expect it it to be a SHOULD rather than a MUST.

2 Likes

The point is not to loosen the use cases this PEP can bring - lockfiles and SBOMs. You can still maintain a URL in a file specifically for uv in the dist-info directory, if desired. There is no restriction on the files that need to be present there, except those mentioned in standards (e.g., uv can have a uv_metadata.json file with additional metadata that could help uv).

If there are no strong objections to permitting empty hashes in the provenance_url.json file, I will maintain the current specification and mention this conversation in the PEP.

1 Like