Also, a separate pull request moving “Selecting a Single SBOM Standard” from Open Issues to Rejected Ideas noting that the current PEP doesn’t preclude a future PEP from selecting a single SBOM standard if one is able to “win” over all the others. Thanks @woodruffw and @steve.dower for contributing to this discussion!
Okay the updates to the “How-to-Teach” section have been merged, please take a look. Now that this PR has landed at this point I’ve addressed all the feedback I’ve received until now. My plan is that if nothing comes up in the next week then I’ll ask for pronouncement. Thanks all!
The current PEP document has a focus on documenting runtime dependencies (including shadow dependencies).
However, it could be argued that tracking information about the build system itself (version numbers and digests of compilers, wheel repair tools, manylinux docker image, CI modules …) can be useful to assess software supply chain integrity in case one such tool has been discovered to be compromised by an attacker to inject malware into Python wheels (e.g. via a compiler backdoor).
I wonder if the PEP should be updated to describe this use case and set of practices or whether build-time provenance tracking is considered outside of the scope of PEP 770
Note that MBOMs might also serve as an enabler to automate https://reproducible-builds.org/ for PyPI wheels and subsequently serve as a proactive tool to detect supply chain attacks rather than just a reactive tool to assess the impact of a known CVE.
Agreed with everything you’re saying in terms of motivations. The PEP itself calls out Pillow’s use of auditwheel to “repair” libraries into the wheel as part of the motivation, so I think that’s covered appropriately? The PEP doesn’t currently call out the “build tools and environment” explicitly being recordable as an SBOM in packages, there’s nothing about the specification that would prohibit users doing so. I can add more to the motivation on that in particular.
Sorry for being late to the party, but I only had a chance to review now. Overall, +1 for this PEP - having a way to include SBOMs will be useful.
The “does it have to be dynamic” question was pretty much the first question that came up for me. It is a little odd that the PEP doesn’t contain the word “dynamic” at all, and all the examples are for static metadata, while it looks like the design more or less forces dynamic usage it seems from the discussion. The auditwheel fork example connects to Paul’s remark here and static vs dynamic - is it feasible for auditwheel to start amending an existing SBOM file/template? That could keep it in the realm of static. If auditwheel can only emit a separate file, then it’s dynamic in nature. The former would be nicer. Is there a way to implement that @sethmlarson, at least for CycloneDX and SPDX?
It would also be nice if the PEP included a more realistic example. You mention the Pillow case multiple times, so why not include that as a worked example? It’d give readers a much better idea of what starting to provide SBOMs will look like in practice. What will the pyproject.toml content, the SBOM in VCS and sdist, and the SBOM(s) in the wheel look like?
Final comment: the motivation explains that Python packaging metadata cannot describe non-Python software components. It’d be good to point out that PEP 725 aims to solve a significant part of that metadata gap.[1]
PEP 725 didn’t move much last year because the authors both had very little time, however we’re revisiting it now and aim to get it over the finish line over the next months. ↩︎
I can add more to the motivation on that in particular.
I think it’s worth making this more explicit as part of the motivations, but only if there is a consensus among experts that it’s a good idea to use SBOMs files for that purpose (improving supply chain transparency to automatically extract a list of packages that are likely impacted by a compromised build system).
One thing I am not sure in particular is whether SBOM formatted files are a good way to specify all the buildinfo necessary for reproducible builds. I am not 100% sure myself if attempting to make wheel packages build bitwise reproducible is a worthwhile endeavor (e.g. compared to leveraging binary scanners to detect malware in wheel packages). Maintaining reproducible build systems is probably not cheap and is really valuable only if upstream build dependencies are themselves reproducible, which can be quite challenging given the heterogeneity of the wheel building workflows.
Specifying SBOM files statically that are included in source (eg: pip bundling software in pip/_vendor).
Build backends can generate an SBOM document that gets included in the wheel (e.g: maturin generating an SBOM for Cargo.lock)
Other tools which modify a wheel can include additional SBOM documents (e.g: auditwheel repair)
The bottom use-case has nothing to do with dynamic because auditwheel isn’t a build backend. For the other two, SBOMs that were generated by the build backend would require dynamic to be used.
However, one case that I didn’t think of is if a project wants both statically bundled and dynamically generated SBOM documents. I don’t think this would be a common situation, but I can imagine a project wanting both? Especially if a build backend wants to add automatic SBOM document generate without requiring its users to opt-in, this could cause issues if those users are already manually specifying SBOMs.
If we want to eliminate dynamic we would need to move to specifying a single directory (Sbom-Dir/project.sbom-dir?) which I believe was suggested earlier in the thread. This would be an entirely new “mechanic” for specifying multiple files in a Python archive (compared to PEP 639 and license-files) but I don’t view that as a blocker?
So in the case that projects want to support both static and dynamic SBOMs they’d specify project.sbom-dir for their static SBOMs and then build backends (and auditwheel) would need to generate and place their SBOM documents into the pre-defined directory with a mechanism for ensuring there are no name collisions (subdirectories could work here).
Hmm, this isn’t my understanding of “dynamic” for pyproject.toml which is that it only applies to build backends and not tools that are modifying the archives post-build, is that assumption incorrect?
Answering whether it’s possible to implement, it’s possible but even dedicated tools for this task are quite involved which is why I wanted to push the problem downstream where they have to implement multiple SBOM standards and “merging” anyways.
I do link out to the example project and a blog post describing the project in the PEP, but I can make other more 1-to-1 example project configurations for this.
I’ll add a reference to PEP 725 about how the two standards compliment each other, thanks!
This was my motivation for the design of this specification, we don’t need to know what the future holds to support this use-case, opaque SBOMs with their meaning defined by specifications and user tooling is enough at this stage IMO.
The purpose of dynamic is to say, “the absence of data for this key doesn’t mean it won’t exist in the final wheel”. Basically we didn’t have a way to say that the absence of data in pyproject.toml was on purpose, so dynamic signals to tools consuming a pyproject.toml file that they don’t have a complete picture for anything listed in dynamic.
That is incorrect indeed. It doesn’t matter what type of tool is involved, it matters what ends up in metadata files (pyproject.toml, PKG-INFO in sdist, METADATA in wheel) that are uploaded to PyPI. A few relevant snippets:
From PEP 621 – Storing project metadata in pyproject.toml | peps.python.org" Data specified using this PEP is considered canonical. Tools CANNOT remove, add or change data that has been statically specified. Only when a field is marked as dynamic may a tool provide a “new” value."
From Core metadata specifications - Python Packaging User Guide“If a field is not marked as Dynamic, then the value of the field in any wheel built from the sdist MUST match the value in the sdist. If the field is not in the sdist, and not marked as Dynamic, then it MUST NOT be present in the wheel.”
That approach probably makes sense, given it’s pretty complex to implement. But then indeed it’s purely dynamic for the auditwheel case.
I read this blog post and the pillow-auditwheel-sbom.cdx.json file, which are indeed quite useful. Before your answer I had some trouble connecting that to what pyproject.toml and other metadata files would look like, which is what I was asking about. I’m still not 100% sure if it’s only pyproject.toml containing dynamic = ['sbom-files'] or if there’s an SBOM template as well. I think the former, assuming auditwheel et al. will be updated the way you suggest. It’d be good to spell that out. The whole SBOM is a bit long indeed though, so linking out to that sounds reasonable. Including one or a couple of shorter examples would be great, thanks.
It may be relatively common for projects where auditwheel vendors dependencies; those are typically larger projects, and those often vendor some packages in their source tree as well. Multiple examples quickly come to mind. E.g., for SciPy:
There’s a set of vendored components in VCS, see core-dev-guide/vendored-code (the list has grown since that was written)
Depending on the platform, shared ibraries like libopenblas and compiler runtime libraries like libgfortran/libquadmath will be vendored in.
Packages like NumPy, PyTorch, Matplotlib, scikit-learn, CuPy and others of that size are all in the same boat (even smaller ones perhaps, since the vendoring of utilities like six and versioneer is quite common). In case it helps you to work out a case like this in more detail, please feel free to ping me - I’m happy to help.
One other case comes to mind: it’s also possible for an sdist to vendor components that do not end up in the wheel (e.g., build, test or benchmarking tools). SciPy et al. have that case too. That also requires a decision about whether or not to include those components in the sdist’s SBOM, and then dynamically drop them from the SBOM file in the wheel.
I think that may actually work better.
Agreed. It’s worth at least trying to work it out a bit more. It may even turn out simpler, since there’s a single directory that either exists or not.
The flip side may be that the simplest static case is no longer as simple; if all a tool can see is that sbom-dir/* exists, it cannot assume that the contents of that directory are identical across the sdist and all wheels. I’m not sure how much of a benefit there is of a tool making that assumption.
For context on everything below: of the three SBOM document sources I outlined, I suspect that statically defined SBOMs (#1) and SBOMs appended to wheels by auditwheel and tools similar (#3) will be the quickest to adopt the standard and begin seeing value and I see dynamically generated SBOMs from build backends (#2) as coming later but being important long-term. This is why IMO being able to avoid needing to specify dynamic to see build backends adopting the ability to generate SBOMs should ideally be avoided as it’s a bunch of churn for a feature that project maintainers should optimally have to care about as little as possible. I should encode these assumptions into the PEP, too.
I’ve done some thinking and here are two approaches that could avoid needing dynamic and enable build backends to adopt this standard without requiring more user input than is necessary.
Approach #1:
Users specify dynamic in pyproject.toml because they know at that point-in-time that the build backend will specify a value, for example version being a required value it makes sense that you’d need to specify dynamic: ["version"] if you’re delegating that responsibility to a build backend.
The key issue with dynamic for sbom-files is that a key being dynamic is all-or-nothing: it’s either dynamic and the build backend handles everything or it’s not and the key is statically defined up-front and the build backend can’t change the value. There’s no room for a build backend being helpful and adding additional information.
SBOMs are different than most metadata, the way they’re framed in the specification they are write-once (“opaque”) from a build perspective and don’t conflict with each other if more are appended to an archive. There is little “consequence” if a build backend were to add additional SBOM documents to a Python package and specify the new document(s) with Sbom-File in metadata even if sbom-files isn’t explicitly dynamic. If we require build backends to only add SBOMs automatically when sbom-files is inside of dynamic this would discourage build backends from automatically annotating builds with SBOM documents as doing so would require users to modify their pyproject.toml for something that should be an automatic net-benefit without user input.
If we can agree that a build backend adding Sbom-File during the build even without dynamic: ["sbom-files"] being explicitly set then the current specification covers all SBOM sources already.
Approach #2:
The other potential approach would be to redefine where tools are expected to look for SBOM documents in a Python wheel. Instead of using the Sbom-File field for both source distributions and wheels tools would need to use a different method for each archive type:
Source distributions would have all SBOM files defined statically by the Sbom-File field in metadata.
Wheels would have all SBOM files be within the .dist-info/sboms directory (at any directory depth, as is the current specification).
The need for using dynamic is avoided in the above approach because build backends would copy SBOM documents from sdists to their relative position in .dist-info/sboms to maintain the same values of Sbom-File metadata (as is done in the current spec and PEP 639). Any SBOM documents that are generated by the build backend would be placed into .dist-info/sboms (with some collision-avoiding logic) but critically would not have an additional Sbom-File field appended to the metadata. Auditwheel could similarly append SBOM documents to .dist-info/sboms and not add any Sbom-File field to metadata.
I don’t think we can agree to that. As Brett said, if dynamic is not set, the metadata as specified in pyproject.toml is definitive, and must not be changed at any point further “down the line” in the build process.
Technically, the metadata is the file name, and while it’s not allowed to change the name(s), it could be permitted to change the file content. But I would consider that a violation of the spirit of the definition of the dynamic field, even if it’s within the letter of the spec.
I’m not PEP delegate for this proposal, but I was for PEP 621, and I would object to dynamic being misused in this way.
That again would be in violation of the specs. PEP 643 defines Dynamic for source distributions, and by that spec, if Sbom-File is not declared as dynamic in the sdist, it must be preserved in the wheel. And if it is defined as dynamic in the sdist, it must be dynamic in pyproject.toml.
IMO, you need to drop the idea of having SBOM data referenced in the metadata, and instead just have it as a set of files in a well-defined subdirectory of .dist-info. This means that SBOM data isn’t discoverable statically[1] in a sdist or source tree, but I think that’s an unfortunate necessity if you want backends to be able to change SBOM data without the user needing to be aware of it.
I’m not sure I follow this comment, am I misunderstanding something? In the example I give Sbom-File wouldn’t be declared as dynamic in the sdist or the wheel, these would be values from pyproject.toml in project.sbom-files. From the user and build backend POV this is exactly how the PEP is proposed today.
The only difference is defining the logic for how tools inspecting wheels discover all SBOMs in a wheel. With this redefinition tools can’t assume that all SBOMs are annotated with Sbom-File metadata, tools would need to look for all files under .dist-info/sboms in case the build backend or auditwheel added more SBOM files during the build.
Maybe I am. If you could give a worked example, that might help.
My understanding is that you’re saying:
In pyproject.toml, you’d have (say) project.sbom-files = "SBOMs/myproject.sbom". And there would be a SBOMs directory in the source tree containing the myproject.sbom file.
In the sdist, all of the above would remain true, but now PKG-INFO would contain Sbom-File: SBOMs/myproject.sbom as well.
Now, when we build the wheel, the wheel would also contain Sbom-File: SBOMs/myproject.sbom in the METADATA file.
All of that is fine. So are you simply saying that the SBOMs directory in the wheel will be interpreted as relative to .dist-info/sboms, where it’s interpreted as relative to the project root in the source tree?
If so, then yes, that satisfies the rules for static and dynamic metadata. But I don’t see how it helps auditwheel or build backends to add SBOM files to the distribution - the metadata is static, and so such additions are disallowed.
The only way with current metadata standards to allow a build backend, or anything else in the wheel building chain, to add metadata to the wheel that’s computed at build time is by declaring that data as dynamic - at which point the user has to supply everything via tool-specific fields. The two relevant bullet points from PEP 621 are:
Build back-ends MUST raise an error if the metadata specifies a field statically as well as being listed in dynamic.
If the metadata does not list a field in dynamic, then a build back-end CANNOT fill in the requisite metadata on behalf of the user (i.e. dynamic is the only way to allow a tool to fill in metadata and the user must opt into the filling in).
So I’m not sure how your “Approach #2” solves the problem I thought you were trying to solve…?
I think what @sethmlarson is suggesting still keeps all of this true.
What he seems to be saying is he wants to have the PEP have two key features. One, the PEP says all SBOMs are stored in an sboms/ directory in a wheel file. Two, sbom-files in pyproject.toml and core metadata lists static SBOMs. Viewing those two things as distinct and that they stack suggests that build back-ends are allowed to add things to the sboms/ directory without having to list them in sbom-files.
Or another way to put it is sbom-files is to help build back-ends know when specific files need to be copied over to sboms/, but it doesn’t preclude other files being put into sboms/ by the back-end either.