PEP 710 - Recording the provenance of installed packages

sbidoul · August 1, 2024, 7:47am

A few questions after re-reading the specification part.

This file MUST NOT be created when installing a distribution package from a requirement specifying a direct URL reference (including a VCS URL).

Only one of the files provenance_url.json and direct_url.json (from PEP 610), may be present in a given .dist-info directory; installers MUST NOT add both.

I realize the first sentence perhaps makes too much assumptions about the UI of the installer. It makes sense for pip and uv, but maybe not for installer ? Would the second sentence be sufficient as far as the specification is concerned ?

In the specification, should we reference PEP 610, or the corresponding page on packaging.python.org ?

Would it make sense to move the paragraph about caching, close to the part that specifies what url is ?

pf_moore · August 1, 2024, 9:52am

As raised in this comment on the pip tracker I’m still confused about the semantics of the URL field.

It seems that the PEP is adding a requirement on installation tools to:

Maintain a source URL value for any cached copy of a downloaded wheel or sdist (note that wheels in particular are identifiable by project name and version, so it’s not necessarily the case that a URL is required to cache in the first place).
Maintain some sort of valid URL for a cached wheel build (where the installer builds a wheel from source once, and then reuses it from a cache when the same source is requested in future).

Is that accurate? If so, I’m not sure it’s achievable in all cases. There seems to be a “get-out” in the sense that it’s acceptable to omit provenance_url.json altogether, but that seems like a rather heavy-handed approach when there might be something useful, but not perfect, that we could provide (as @charliermarsh mentioned over the situation where a hash isn’t available).

The PEP doesn’t really explain the intended use of the recorded URL, which would make it a lot easier to discuss semantics - is it a requirement that the recorded URL can be used to re-fetch the original source, or is it just for information?

For example, some index servers apparently serve wheels using a temporary, time-limited URL, which is completely useless for any form of fetching the wheel at a later date (apparently this is related to how the index handles user access limitations). I can’t tell from the PEP whether an installer should record that URL or not (and if the answer is “don’t record it”, I have no idea how you’d detect that it’s a non-permanent URL to make that decision).

I would strongly recommend that you don’t tie this PEP to lockfiles (I have no view on SBOMs, as I don’t know much about them, but I would be careful not to assume familiarity with SBOMs and what they need in the PEP text). The lockfile discussion is ongoing, and is tackling similar issues and could easily come up with different, possibly even incompatible, answers. If locking a pre-existing environment is a critical motivating case for this PEP, then you need to wait for a standardised lockfile to be agreed before trying to support it with this proposal. Otherwise, avoid assuming that lockfiles are going to be able to use this data, and justify this specification without reference to locking.

On that note, I’ll also mention that the motivation section of the PEP is very weak. It makes statements like “there are use cases for keeping records of distributions” and “users might get into a situation… and immediately finding out which wheel file was actually used during the installation might be helpful”. But there’s no concrete examples of where this happens, or how the PEP would address the problem. It’s all very vague, and seems to be based on a presumption that “knowing where stuff came from” is an obvious good thing to have, and can be assumed without justification.

As an installer maintainer, I feel that the PEP doesn’t do a sufficiently good job of explaining to me what I need to provide. It feels like it would be way too easy for a user to come to pip complaining “the recorded URL isn’t right”, because it doesn’t satisfy their use case, and we would have no way of pointing to the standard and saying that what we provide is compliant, and their use case simply isn’t supported by the standard^[1].

Also, as a final procedural note, please remove references to PEP 610 from this PEP. You should be referencing, and linking to, the PyPA Specifications, which are the canonical specs - in particular Recording the Direct URL Origin of installed distributions and Direct URL Data Structure

In case it’s not obvious, we’ve had similar situations with other standards that in hindsight have turned out to be more vague than we’d anticipated, and it’s caused us no end of trouble. I don’t want that to happen again, if we can avoid it. ↩︎

sethmlarson · August 1, 2024, 5:03pm

Agreed, I would also like to see the PEP have a stronger motivation. Here are my own motivations for why recording the URL and the hash for every installed file is important:

URL and hash records the identity of the installed file (or closest to thing to an identity). This was a primary motivation to including the index_url field as well, if we know where an installer downloaded a file from we can use that URL to reverse a software ID (for example https://files.pythonhosted.org/packages/.../urllib3-2.2.2.tar.gz maps to the Package URL pkg:pypi/urllib3@2.2.2 [1]. Without the URL and hash it’s not possible to reverse an installed file back into it’s software ID reliably, since we can’t tell if a file is actually from PyPI, if it’s a modified version of an upstream package, or if it’s complete circumstance that an installed file has a matching name with an upstream package.
Being able to investigate or audit the installed environment. The installed environment is often disjoint from the instructions that determined which packages to install (for example, a Docker image, hosted build environment). Vulnerability scanners optimally take the resulting artifact (the Docker image) instead of the build instructions (requirements.txt, lock file, etc) to run their scans because it’s possible that the resolver or context is different between the scan and image build. For this reason, having the decisions that the installer made when creating the environment encoded into the environment makes the job of downstream processors easier.

The SBOM case is similar to the scanning case, where being able to identify reliably all the software artifacts contained within some artifact (lock file, container image, operating system, virtualenv, etc) is what matters. Having this information stored somewhere explicitly reduces the need to guess in these cases.

Agreed, I think the lock file case isn’t necessary for the motivation of this PEP.

The most important part IMO is to track what got downloaded from where.

Would a description of using the URL and hash(es) of the original downloaded file (after all redirects, etc) that is downloaded to satisfy a requirement, regardless of whether an intermediate form (ie, a built wheel from a downloaded sdist would record information about the sdist, not the wheel) is cached or used otherwise. I don’t have enough context into the interaction between cached files, source dists being built into wheels, and other interactions to know whether that covers all the cases for installers today, this is where the experience of installers would be most useful specifying what gets encoded in the record.

Private repositories that have one-time URLs is an interesting one, but presumably recording those URLs is still useful from an auditing perspective? The concern is encoding authentication information into an unexpected location like a Docker image.

1: Package URLs (PURLs) are a popular software ID standard, another is CPE which is common on CVEs but requires slightly more work to reverse from a package name and version (but not impossible!)

sethmlarson · August 1, 2024, 5:38pm

Also wanted to call attention to this comment, where the specification for Direct URLs also doesn’t make mention of the risk of authentication credentials in recorded URLs: PEP 751: lock files (again) - #61 by ncoghlan

pf_moore · August 1, 2024, 5:47pm

Even with the URL, can we actually do that? What would a URL https://localhost/urllib3-2.2.2.tar.gz map to? It might be a local proxy to PyPI, and there’s no way of knowing. And there’s presumably no canonical mapping for files located on localhost.

I’m not trying to be difficult here - I genuinely don’t know what the use cases are for SBOMs. And while it’s quite probable that “this would never happen” is the correct answer for SBOM use cases, installers have to work with anything, no matter how weird and niche - we simply cannot assume that we’ll only see the happy cases for the particular use case the PEP wants to support.

I know I’m getting into edge cases now, but it seems to me that there’s a disconnect here between the ideal (reliably identifying the software artefacts) and the reality (most of the time we probably can, but not always). I’m struggling to understand where the practical “good enough” is supposed to be drawn, given that the ideal is demonstrably not possible to achieve.

For example, if I install requests in an environment and then manually patch it for some reason, the provenance data is wrong. That’s not a situation that any standard can deal with - even the hashes in the RECORD file aren’t enough because I can patch them, too. And to be clear, I’m not talking about malicious changes here - there could be completely legitimate reasons for doing this.

So I’m fine with providing a URL where possible, but what I (as an installer author) consider “possible”, and what you (as a consumer of this provenance data) consider necessary, may not match - and there’s nothing in the PEP to guide us to a compromise or acceptable middle ground. This is what I was thinking of when I referred to people raising issues saying “the URL is wrong”.

OK. But if I tell you “X file got downloaded from URL Y”^[1], that gives you no assurance that you can retrieve the same file now. It’s a record of what was done, not an instruction for how to reproduce the action. And the two are fundamentally different.

If you genuinely mean that recording what the installer did is all that’s important (and you are happy to accept that it may not help you reproduce the environment) then that’s fine. But the PEP needs to make that very clear because I’m 100% sure that someone will interpret the data as providing a way to get the same file as the installer used…

I have no idea. It’s not my job to know or assert that - it’s the job of the PEP to make that statement and provide evidence that the relevant users agree.

I’m even willing to offer “on such-and-such date” if it helps ↩︎

sethmlarson · August 1, 2024, 6:08pm

Never assumed this! I really appreciate you providing the edge cases and pushing the motivation in the right direction.

The perfect happy-path case won’t always work, I think it’s worth it to still record the information.

There are a few tricks that scanners can pull to “try harder” to make the software ID match happen (for your example, comparing the hash to the upstream PyPI hash for a similarly named but different URL record means it’s a name and content match of the upstream and likely a mirror).

The provenance data isn’t wrong, it records a traceable link that could be used to show the artifact was modified locally (assuming that someone can’t modify hashes of artifacts on PyPI, in which case we have bigger problems). By having a known “upstream” it becomes possible to encode “pedigree/patching” into an SBOM document automatically, too, for indices that are known to have an immutability guarantee.

That is my primary motivation for this PEP, yes. I can appreciate how there are cases where someone might lean on this information incorrectly and get burnt. This is a similar situation with the Direct URL standard too, there’s no guarantee that it’ll work again just that it happened.

Agreed, since you mentioned it do you happen to know of projects / persons who we might ask to provide this feedback?

pf_moore · August 1, 2024, 8:39pm

It came up on the lockfile discussion, where @orf mentioned knowing of the existence of such repositories. I have no idea whether he has any idea what’s useful for provenance or auditability questions, though.

OK, in the light of that, here’s an extreme, but plausible, case. By that I mean that there are pip users who do every one of the steps in this example - I’m not aware if anyone does all of them together, but it certainly could happen.

The user has a local devpi instance, set up at a 10.x.x.x non-public address, which proxies PyPI.
The user runs pip wheel <some set of requirements> to download and/or build wheels for a set of packages, using the devpi instance as their only index. The wheels are put in a directory on the user’s PC.
The directory full of wheels is physically copied to a different system, that has no internet access, and no access to the local devpi server.
On that airgapped system, the user now runs pip install <same set of requirements> --no-index --find-links <wheel dir>.
Just to make life difficult, the user then deletes the directory of wheels, on both the target system and the intermediate system.

Some questions:

What, logically, is the provenance of the various packages in the final environment? It seems to me that “they all came from PyPI” is the most reasonable answer.
Do provenance use cases have the concept of a “chain of responsibility”, such that “A came from system 1, which got it from system 2” is a valid provenance? If so, does the PEP intend to rely on that type of “partial” provenance? If it does, that should be explicitly stated in the PEP.
Note that steps (2) and (3) of the above process explicitly break the provenance chain. There’s no metadata in a directory full of wheels that records where it came from.
Is it acceptable for a statement of provenance to include some sort of manual assertion like “I got these wheels from PyPI and copied them to the server”? For every use case that we want to support? If so, why not just make a manual statement for the whole thing, and drop the PEP altogether?

While I’m happy to hear answers to these questions in the context of SBOMs, the PEP is not “Providing SBOM data for Python environments”, so broader answers will be needed. That brings us back to the “what is the scope of the PEP” question, though.

My impression is that SBOMs are likely to be a key consumer of the data provided by this PEP, so what is needed for a SBOM is important. It’s just that it can’t be the only factor, unless the PEP is re-scoped to be just for SBOMs (and anything that can work with data targeted at SBOMs).

One other point that bears mentioning. The packaging ecosystem is fundamentally based around the idea that “foo 1.0” uniquely and accurately describes something that can be installed. Essentially, if “foo 1.0” is installed in an environment, I can get the source by downloading the sdist of foo 1.0 from PyPI. A big part of my difficulty in getting an intuition of what SBOMs^[1] are about is that I can’t get my head around why it even matters where the copy of “foo 1.0” on my system came from - except in the context of things like legal liability and security concerns, which are precisely the sorts of case where the sorts of “best guess” answers we’re considering are most unacceptable…

and maybe provenance as a whole ↩︎

ncoghlan · August 2, 2024, 2:09am

Concrete example: I’m working on a project that makes virtual environments portable (with caveats), and that means deleting any installed executable scripts outright. To avoid archive reproducibility problems that arise from shebang line rewriting in shipped (rather than generated) scripts, I delete all the RECORD files too (“no support for adding/removing packages from redeployed environment bundles” is one of the caveats).

The fact an unmodified Python auditing tool would fail on such environments is a feature rather than a bug, though.

fridex · August 2, 2024, 10:18am

Adjusted here.

The paragraph that specifies url is a continuation of the provenance_url.json specification. The url specification is followed by specifying what hashes is as a logical part for specifying file content. The paragraph related to caching follows content specification, as its addition. What is the case for moving the caching paragraph closer to the url specification?

The sentence can also be read from a library API point of view. Here is a PR to remove the sentence if its removal is required.

Thanks to @frostming we have an initial implementation in PDM. I’ve opened this PR to pip that have already gone through an initial review. Once the community consolidates on the last bits and understandings, it sounds like we could progress with the PEP - please let us know if anything else is needed.

I will be offline next week, also giving this PEP and discussions around it time to finalize.

pf_moore · August 2, 2024, 1:19pm

What do you consider the “last bits and understandings”? As the PEP author, you need to direct and manage the discussion, and currently I’m not at all clear what you feel is outstanding. For instance, we’ve already mentioned that the rationale/motivatioon section needs tightening up, and the scope and use cases need to be pinned down more specifically. Is that something you feel you have enough feedback to work on, or do you need more from the community? Remember, it’s not up to the community to justify this proposal - it’s your PEP and you need to make a persuasive argument that it’s worth implementing.

I see feedback (from @woodruffw and @sethmlarson) that auditing tools and SBOM creation tools might be able to use this data, but there’s still a lot of open questions around situations where an installer might not be able to write a provenance_url.json file or (worse?) might record a URL that is unhelpful, or inaccessible. What’s your position on this? Do you want to try to put the responsibility for dealing with those edge cases on the installer or on the consumer?

Right now, it feels like the PEP leaves a lot of space for “implementation defined” behaviour, which is not ideal in an interoperability spec. For example, in the scenario I described above, if a user does pip install foo --no-index --find-links some_dir, it’s not clear to me without reading the pip source code whether pip would omit provenance_url.json, or write one with a file: URL pointing to a file in a local directory that no longer exists. And as a pip implementer, I’ve no idea which of those options is more useful to consumers.

sethmlarson · August 2, 2024, 11:37pm

“Where it came from” matters for software ID, for vuln detection the most important field in an SBOM is the software ID. Vulnerability scanners will take the set of software IDs in an SBOM and try to match that against all the vulnerability databases out there (CVE, NVD, GHSA, OSV). Without software IDs in your SBOM, that SBOM is not useful for this use-case and tools rightfully complain about IDs being missing. The requirement for software IDs in SBOMs is repeated in regulations and guidance around SBOMs.

Without knowing whether a Python package came from PyPI, a private index, Linux redistribution, or somewhere else it’s not possible to make a determination on a software ID. This matters because vulns affect different software distributions differently depending on their applying of security fixes (a previous release of “foo 1.0” might be vulnerable to a CVE if it’s installed from PyPI, but Red Hat backported the fix to “foo 1.0-1” on their distribution, for example).

SBOM documents can be generated from the information on-hand and then further augmented later, for the example you provide it would be difficult for an automated tool to generate a correct SBOM on the air-gapped system itself, but with the set of hashes encoded in an SBOM could be further augmented to provide more context to someone/tool down the line.

If there’s not enough context to determine with certainty the software IDs, that is an acceptable gap for a tool to highlight in an SBOM document. I assume that today more tools are “guessing” that Python packages are from PyPI without checking, something that this standard might help provide some mechanism for disambiguation.

If the motivation is going to be pushed towards software IDs (which IMO is still a worthwhile motivation and covers @woodruffw and I’s use-cases, William to confirm), we can add that this needs to be “installed from an index”. Would that remove all the ambiguity around when to use a Direct URL record versus Provenance URL record for an installer?

fridex · August 3, 2024, 8:28am

I expect alignment in the community and coming to conclusions so what WE come up with is valuable and useful. It is not “my PEP” - the PEP is here to serve its purpose and help.

I was part of an incident response team where we were unable to properly audit environments and what is stated in PEP-710 would significantly reduce engineering time (see Security Implications in the PEP as an example). These things can have also direct customers impact. There can be crafted own solutions and let every company, application, or installer implement it own way, but having this standardized across Python ecosystem is IMHO worth it. See also @sethmlarson’s comments on SBOM tools below.

Okay, we can elaborate on it.

+1

From PEP:

The value of the url key MUST be the URL from which the distribution package was downloaded. If a wheel is built from a source distribution, the url value MUST be the URL from which the source distribution was downloaded. If a wheel is downloaded and installed directly, the url field MUST be the URL from which the wheel was downloaded based on content path prefix. As in the direct URL origin specification, the url value MUST be stripped of any sensitive authentication information for security reasons.

That is, “the URL from which the distribution package was downloaded” - at the given point in time, when the Python distribution was downloaded (cached or installed) it was valid as the installation process created provenance_url.json upon a successful installation. It does not matter that right after the installation (or one year later) the URL was not accessible anymore for whatever reason, from PEP perspective.

The URL can point to PyPI, and over time the distribution can be removed. Also, the URL can point to a server running on localhost or on a local network. It can also be a URL that was used during a hermetic build and is not accessible from developer’s computer. The server can change its port tomorrow or be available under a different name and we could go on. It simply does not matter. What matters, and PEP states it: the installer used the given URL to download a distribution. Then, there can be use cases for the given URL when it would be still valuable even when the given distribution cannot be re-downloaded (I assume this to cover “unhelpful, or inaccessible”).

For example, the URL can encode a content hash in case of some hermetic build which can be helpful when debugging the build process. It can also be helpful to see which server was used to obtain artifacts in region-specific installations. For specific builds of TensorFlow, the URL can show which optimized wheel, built specifically for some runtime environment, was downloaded. These cases can go on, but I would rather not continue. We cannot document all the possible cases, and there will be still “+1 special case” for this. It is up to the maintainer of the environment used in builds/installations to understand how the environment is set up and how can the url field be helpful in their specific cases (if they find it useful to use it). PEP states that the url is the URL from which the distribution package was downloaded.

If you, or anyone else, feel this needs to be explicitly stated in the PEP, we can do so.

Okay, we can include pip install foo --no-index --find-links some_dir and explain it in the examples section. Is there anything else we should cover?

To proceed, it looks like we have these remaining things:

the rationale/motivation section needs tightening up
state pip install foo --no-index --find-links some_dir in the PEP
if required, state that the installer should not care about the url validity in time
if required, @sethmlarson’s clarification:

Have I missed any?

pf_moore · August 3, 2024, 10:08am

That would suggest that a hash without a URL would be useful, but a file URL without a hash would not, for example. Is that right?

I’m not sure. Is an index proxy still an index? Is a URL of https://10.1.17.2/devpi/+pmoore/pip/pip-24.0-py3-none-any.whl still useful? Even though it completely hides the fact that the wheel came from PyPI?

OK. Can that be added as a note in the PEP, please? It wasn’t at all clear to me from the text you quoted.

pf_moore · August 3, 2024, 10:45am

Sorry if I came across as too negative, or as dismissive of the role of the community’s views. I didn’t mean to. What I was trying to say was that the discussion here is almost entirely between people who are familiar with the problems this PEP is trying to solve, and with why having this data would be a good idea. But the PEP itself needs to be written so as to be understandable and accessible to people who don’t understand the motivation, and who may not see a need for the standard at all. The obvious example is installer maintainers who have to implement the standard, but other people may need to read the spec as well, and it needs to make sense to them.

My point is that as the PEP author, you need to take what’s been agreed on here, and present it in a way that is understandable to that audience. And in my view (both as PEP delegate, and as one of the people not familiar with the use cases), the PEP doesn’t currently succeed in that.

I’m trying to explain my confusion here, hence my “what does all this mean” posts, but it’s very hard to bridge the gap between my lack of familiarity with the use cases, and the discussions here. Your example of being on an incident response team helped me a little, because it helped me relate a situation this PEP might apply in to my own experience. It would be worth adding this explicitly to the motivation section. Although I will admit that when I’ve done incident diagnosis, “where did this 3rd party package come from” was rarely if ever a key question (I’d be looking at higher level aspects of the system if the components that made up a production environment weren’t already locked down by environment build processes, etc).

In a similar way, the explanations @sethmlarson has given about SBOMs have been useful, but I’ve never used or produced a SBOM in my career, and my instincts are that I’d expect it to be something that was produced at environment build time, by the tools/workflow that builds the environment, rather than after the fact by analysis of what ended up being present.

So I guess the common theme here, which I’d like the PEP to clarify, is why do we need an after-the-fact mechanism like this rather than focusing on environment build processes that produce information like SBOMs and audit reports as part of the build? Specifically, an installer isn’t (IMO) an environment build tool, it’s a low-level utility that would be part of an environment build process.

Conversely, these days most of the environments I build (for hobby and casual use) are very adhoc, and don’t use a formal “environment build” workflow. But that’s fine, as these environments are also not ones where I expect to need auditability, provenance data, or install tracking. So having the installer record this data in that type of environment is useless^[1] to me.

So what I’d like to see in the PEP is a discussion of:

What sorts of environments would this data be useful in?
Why is it not reasonable to integrate the collection of this data into the environment build process?
If we do want to push the responsibility for recording this data to lower-level tools like installers that lack the wider context, what are the implications of that limited viewpoint on the data that can be collected, and are those implications acceptable for the use cases we’re talking about?
In the longer term, what would be the fate of this low-level data as users start to take a more holistic view and incorporate provenance tracking into environment build processes? (I.e., is this a short-term solution which will become obsolete, will it be re-purposed into a component of a bigger standard, or what?)

Is that something you can add to the PEP?

(By the way, I’m trying very hard here not to frame my comments as “this is something that corporate users need, why are volunteer developers being expected to put time into developing and maintaining this?” But it is a question that’s at the back of my mind - why does pip need to maintain this data so that companies can fulfil their responsibilities to provide SBOM data for their commercial projects, rather than the companies doing the work themselves as part of environment build?)

not wrong, just of no benefit ↩︎

ncoghlan · August 4, 2024, 4:25am

In particular, I don’t think the potential relationship described between this as-installed metadata and an as-built SBOM is sensible as described. The current PEP text suggests an SBOM might be generated from the installed provenance metadata, but it seems to me that a more fruitful relationship would be to use it as an auditing mechanism to confirm that an as-installed environment is consistent with a given SBOM emitted by the build process.

For example, consider this deployment flow:

build process emits a wagon archive and an expected environment SBOM
device updater creates a Python environment by installing artifacts from the included wagon archive
auditing tool checks the as-installed metadata in the device against the as-built metadata in the SBOM

Key questions for the PEP to address in that scenario:

assume wagon is NOT updated to propagate information from the archive build process for inclusion in the provenance_url.json files. What assertions would the auditing tool be able to make about the as-installed environment when comparing it to the as-built metadata?
now assume wagon IS updated to propagate provenance information from the archive build process for inclusion in the provenance_url.json files. What information would it need to propagate? How would it pass that information to the installer? What stronger assertions would the auditing tool be able to make given this additional information?

(This isn’t about checking for malicious modification, it’s about checking for bugs like devices downloading packages directly from PyPI, rather than relying solely on prebuilt packages provided by device specific means, as in the example above)

woodruffw · August 5, 2024, 3:32pm

Seth Michael Larson:

“Where it came from” matters for software ID, for vuln detection the most important field in an SBOM is the software ID. Vulnerability scanners will take the set of software IDs in an SBOM and try to match that against all the vulnerability databases out there (CVE, NVD, GHSA, OSV). Without software IDs in your SBOM, that SBOM is not useful for this use-case and tools rightfully complain about IDs being missing. The requirement for software IDs in SBOMs is repeated in regulations and guidance around SBOMs.

Without knowing whether a Python package came from PyPI, a private index, Linux redistribution, or somewhere else it’s not possible to make a determination on a software ID. This matters because vulns affect different software distributions differently depending on their applying of security fixes (a previous release of “foo 1.0” might be vulnerable to a CVE if it’s installed from PyPI, but Red Hat backported the fix to “foo 1.0-1” on their distribution, for example).

+1 to this rationale – we’ve seen multiple variants of this crop up with users of pip-audit, and having this kind of URL provenance would allow us to improve both our positive and negative accuracy on package/version matches.

Yes, confirming – for auditing purposes “installed from an index” is perfectly sufficient!

IMO this is still useful: having the URL at all allows pip-audit to be configured for different behaviors: with the metadata in this proposed PEP, we’d likely warn (and ignore) non-PyPI URLs by default, and then expose settings in pip-audit allowing people to explicitly check them anyways.

In the case of mirroring, this might be a place where this PEP and PEP 708 can interoperate slightly better – PEP 708 gives indices the ability to mark “tracked” or “alternate location” indices, meaning that a mirror URL in provenance_url.json could in theory be tracked back to the upstream PyPI URL that it came from. To do this reliably, however, this PEP may need to additionally include an index_url or similar base key that points to the base index that the URL was resolved from (e.g. https://pypi.org/simple). This would have to be empty for direct URL installations of course. @fridex I’m curious for your thoughts on the above

sethmlarson · August 5, 2024, 6:31pm

Without hashes the URL is less useful since it’s not verifiable, but assuming you’re not in an adversarial environment the URL is still useful.

There are multiple types of SBOM, it sounds like you’re talking about “build” SBOMs which are handled in the way you describe, produced when the described artifact is built (ie, when the wheel is built), but like you mention this is not the job of installers.

For “runtime” SBOMs, because of the diversity of tools involved in creating a runtime environment (docker, pip, Linux package manager, etc) after-the-fact analysis is where tools are heading, at least right now. Generally this ends up meaning that tools are generating runtime SBOMs from a composition of all the packaging metadata that is included in the environment, PEP 710 would be “doing our part” to make that composition for Python environments.

This is definitely a confusing topic because, like you mention, it’s brand new and very few people have used the tools or worked with an SBOM before. I am drafting up an “overall SBOM strategy for Python packages” that I will be requesting community input for, so stay tuned for that.

Correct me if I’m misunderstanding, but to build such an auditing mechanism you’d still need a tool that’s capable of generating an “actual” SBOM from the environment to verify against your “expected” SBOM.

This is a good comment to address because I try to be cognizant of this point in my work related to SBOMs. Not many folks will need an SBOM directly when they’re wearing their volunteer hats.

SBOMs are the direction that the software industry is heading and there will be demand for SBOMs of upstream, something that I don’t want to manifest as a mass opening of GitHub issues and emails to maintainers asking for SBOMs with the ecosystem caught flat-footed. My current goal is to pinpoint the few areas where change would need to happen to empower SBOM generation tooling and then let those tools be responsible for the rest. IMO, an index installation record is one of those areas.

Since I’m paid to work in this area, I’m happy to contribute towards this effort to further reduce the work from volunteers. Obviously there are limits to that, tool and standards maintainers will still need to review submissions, but if there’s an area that you think is neglected in either the implementations that @fridex has linked or elsewhere I am happy to fill the gap.

ncoghlan · August 6, 2024, 7:36am

You’re not missing anything, I was (specifically, the runtime SBOM vs build SBOM distinction that you mentioned in your post).

That said, I think the gist of my request still holds:

no matter the terminology, the current PEP text doesn’t really explain the auditing task of checking an as-installed runtime environment against an as-built SBOM (the closest it gets is a reference to “Another use case could be tools reporting software installed, such as tools reporting a SBOM (Software Bill of Materials), that might give more accurate reports.” without explaining that in an auditing use case a runtime SBOM would be generated in addition to a build SBOM, rather than instead of it)
the PEP doesn’t consider options for potentially propagating provenance information through wheel caches (whether downloaded or locally built from source), not even to the extent of explicitly deeming that to be out of scope (while it would be nice if there was a standardised way for tools like wagon to propagate provenance info from the environment that constructs the wagon archive to the deployment environments that install the included wheel files, it’s also something that can readily be deemed a “later” problem).

ncoghlan · August 8, 2024, 8:30am

Just noting that I misread the spec page when I posted this comment. The risk is mentioned and handled by these three paragraphs:

When persisted, url MUST be stripped of any sensitive authentication information, for security reasons.
The user:password section of the URL MAY however be composed of environment variables, matching the following regular expression:
\$\{[A-Za-z0-9-_]+\}(:\$\{[A-Za-z0-9-_]+\})?
Additionally, the user:password section of the URL MAY be a well-known, non security sensitive string. A typical example is git in the case of a URL such as ssh://git@gitlab.com/user/repo.

I’ve posted a PR to give these three paragraphs their own subheading: Add Direct URL security heading by ncoghlan · Pull Request #1585 · pypa/packaging.python.org · GitHub

fridex · August 15, 2024, 1:13pm

Sure, will do. Thanks!

No offense taken, no worries.

Yes, definitely. Thanks for your explanation - I understood PEPs are more for a tech discussion to get consensus on a spec, but if they should give also guidance to readers without specific expertise, we can definitely elaborate and make the PEP more understandable. I will try to do that, thank you.

While I do agree storing index_url could be useful in some cases, we decided to drop it from the spec. See the rejected ideas section. I can imagine auditing tools, such as pip-audit, to maintain their mapping for download locations (e.g. any url value of https://files.pythonhosted.org/packages/* means PyPI). This can also help with own indexes that aggregate packages and point to PyPI. As PyPI provides information about hashes on API, it might be also worth doing a hash comparison, rather than solely relying on the url field.

EDIT: … so that pip-audit can detect mirrors when auditing against PyPI, which might give much better accuracy, instead of performing a check on the url field.

EDIT2: … or provide information about installed distribution that is no longer available on PyPI (or any other index that provides information about hashes), warn about possible dependency confusion and so on. I think there is some can of features opened when using hashes.

+1, the build system is the source of truth on what packages were installed.

If we take a look at public container registries, we do not have any provenance information in container images. If I download a container image from a public registry as an open-source enthusiast (note, no corporate use case here), there is basically no information about provenance. While I do care about security, I want to have at least a notion of where packages were installed and their integrity (and other tools, such as pip-audit, can assist with that). In an ideal world, these container images would have associated SBOMs from the build system (and additional build information), but we are nowhere close to that in open-source – at least as of today. Chainguard is putting micro-SBOMs of installed packages into their container images, so anyone who downloads them can audit them. I’m not saying Python installers should solve this issue, not at all. Nevertheless, as we see multiple use cases for PEP-710 where it can be helpful either for the community or in companies (see also @ncoghlan’s wagon archives, auditability and reproducibility of environments, SBOMs, eventually lock files), why not provide it.

Thanks for the comments, I will start incorporating them into the PEP. It looks like we indeed need to clarify approaches.