PEP 770: Improving measurability of Python packages with Software Bill-of-Materials

Working on a large ecosystem of “pure Python” projects with
enterprise users who may not understand the difference, one thing I
wonder is how such projects can programmatically signal that they
don’t bundle/vendor any additional software so that scanners and
inevitable “plz to be adding SBOM” bug reports will short-circuit.

8 Likes

Except tools that consume documents already have to be able to read multiple formats, and can be updated more frequently than PyPI, so all that PyPI validation will do is ensure that only older formats are allowed.

I’m not going to force you to choose a single SBOM standard[1], but I think if the PyPI verification is that important then we need to fully specify what it’s verifying, and we need the tooling that can verify before uploading (independent from whatever tooling creates the SBOM). We can’t just leave it to a reference to another external spec, especially in an area that’s still innovating.


  1. Our $work tools generate SPDX, so if you choose something else then we’ll have to introduce conversion on the publishing side, even though consumers already have to support conversion on the ingestion side. ↩︎

Thank you for this PEP and I wanted to only add a voice in support for this. As a European there are two rather imposing legislations being introduced that both directly and indirectly will require use of SBOMs.

First we have NIS2 which “only” indirectly requires it by putting an increased focus on supply chain security, where software suppliers most certainly can and will be regulated with lots of ifs and buts.

And secondly CRA will require SBOMs for any product with “digital elements”. I don’t know how that will play out, but suffice to say… having Python packaging support the use of SBOMs will be a good thing.

I cannot add much more than this, and having read through many of the previous replies… I’m getting imposter syndrome just thinking about it :wink:

5 Likes

I’ve published some changes from the above review. Summarizing the changes:

  • I’ve moved all the details about SBOM content into its own section “SBOM data interoperability”. This entire section is optional for SBOM data producers, PyPI, and indices but attempts to capture some of the things producers can do if they want to have PyPI “check” their data or for their data to work automatically with consuming tools.
  • Added a section about how the PEP itself treats SBOM data opaquely, not requiring a specific SBOM standard, version, or set of data. Forbids indexes for rejecting data that they don’t understand (but allowing them to reject invalid data they do understand). I think this should be enough to discourage “locking-in old standards”?
  • Maintained the “SHOULD” Sbom-File file presence check for PyPI in the core metadata section.
  • Fixed the noted typos, thanks all for finding these.

Not saying that this won’t happen, but given the status quo is that SBOM generators will generate an SBOM with only the installed packages (not recording bundled software, which is the “phantom dependency” problem) I am doubtful that generators will begin complaining about a lack of “confirmation” that there’s no bundled software.

So if you’re a project which is pure-Python and doesn’t bundle any other software then you shouldn’t have to take any action, existing SBOM generating tools already handle that case correctly. I would like to avoid the need to have projects do something only to “confirm” they have nothing to do if possible.

7 Likes

Thanks @sethmlarson!

Some comments/feedback below, divided by sections.


Another small typo:

Reason: .. must not be used. \\ is an invalid path delimited, / must be used.

Should probably be delimiter, I think.


A semantic question about project.sbom-files: how do these two inclusions interact?

[project]
# both directories contain `foo.json`
sbom-files = ["sboms/*", "more-sboms/*"]

This seems pretty unlikely, but it might be good to have a SHOULD or MUST telling build backends that they should error if they can’t give each SBOM a unique name within .dist-info/sboms.


How strong should the JSON prescription be? The " Add sbom-files key" section has a MUST:

Tools MUST assume that SBOM file content is valid UTF-8 encoded JSON, and SHOULD validate this an raise an error for invalid formats and encodings.

…while the interoperability section has a SHOULD:

  • SBOM documents SHOULD use UTF-8-encoded JSON (RFC 8259) when available for the SBOM standard in use.

I think these should be unified, in either direction (but with a slight preference for a MUST).

This makes a lot of sense to me! I would love it if we could standardize on a single blessed SBOM format, but I agree with @steve.dower’s point about this effectively being a draw/artificial restriction in favor of older format versions.

(Plus, maybe this could be ratcheted down in a future PEP, once SBOMs as an ecosystem stop moving so quickly? I have no strong opinions there.)

1 Like

PEP 770 – Improving measurability of Python packages with Software Bill-of-Materials | peps.python.org specifies the relative directory structure must be kept, so there won’t be any file name conflict.

If we are not going to mandate a format then it can’t go past a SHOULD when an SBOM format offers multiple file formats.

Maybe, but it will be harder once wheels and such are out there with SBOMs that don’t match some future choice.

2 Likes

Good catch, I’ve removed this line from the PEP in favor of leaning on the interoperability section. Also fixed the typo. PEP 770: Remove MUST for JSON+UTF-8, fix typo by sethmlarson · Pull Request #4201 · python/peps · GitHub

1 Like

@pf_moore I’ve extended the “How-to-Teach” section to include much more detail about what Project Maintainers, Users, and SCA tool authors will need to know (and how they will come to know this information) as a result of this PEP being accepted.

Both Paul and @steve.dower, you had concerns about how the new metadata field/pyproject field was specified, do you still have these concerns? I haven’t changed much, in this thread I pointed to PEP 639 as a source of where I grabbed the language from on how to specify these things. Happy to get pointed in the right direction?

Also, I’m putting out a last-call on keeping the SBOM file opaque, I haven’t heard more specific feedback against this approach, IMO this is the only open question that needs to be answered before we can proceed. The other on “conditional” SBOM files can be figured out later by implementations as Paul mentioned elsewhere.

1 Like

I think there’s still a bit of a conflict on the Sbom-File metadata field between:

The path is located within the project source tree, relative to the project root directory.

And

the .dist-info directory MUST contain an sboms subdirectory which MUST contain the files listed in the Sbom-File fields in the METADATA file at their respective paths relative to the sboms directory

My guess is that the first has been miscopied from a project metadata section into core metadata, because it’s got to be consistent with the second (built and installed distributions don’t ever modify their metadata, so whatever is defined as “core” has to work unmodified here).

My understanding (and I could be wrong) is that sdist PKG-INFO metadata also does not change, but simply gets rewritten into METADATA. Tools that want to resolve paths in PKG-INFO would need to build the wheel (to ensure the files are in the right place) or use the pyproject.toml fields (which refer to source locations rather than final locations).

But Core Metadata field values ought to be able to be unchanged from start to finish, allowing for them to not be useful until the package is actually installed. The current definition kind of allows that, but only by adding very specific file copies into the source->sdist and sdist->bdist stages. I think it would be better to only define the Core Metadata field in terms of its final installation location and let build tools work backwards from there, while the pyproject.toml field is for source/sdist layout (and tools that operate on those formats).

The paths should carry over. If in PKG-INFO the path is somewhere/stuff/my_sbom.json and that’s where the file is in the sdist then that should end up in .dist-info/sboms/somewhere/stuff/my_sbom.json.

2 Likes

Like Steve, I’m still not convinced it’s clear enough. The file names are described as “relative to the root SBOM directory”, but the link in " The root SBOM directory is specified in a later section" is broken. If I assume it is meant to point to the section “SBOM files in project formats”, that section doesn’t actually say that it defines the term “root SBOM directory”.

It’s sort of OK, in the sense that you can work out what it’s trying to say. But you shouldn’t make your reader work that hard to understand your meaning. Furthermore, you should really define terms before you use them. So start with a definition of the term “the root SBOM directory”, and define the metadata item after that.

You should also be explicit that the sbom-files key in pyproject.toml is optional.

Actually, I’d also suggest making it clear up front, in the “Abstract” section, that most projects won’t need SBOMs. If you have no prior experience of SBOMs (like me!), the following paragraph suggests that SBOMs will be needed for most, if not all, projects.

This PEP proposes using SBOM documents included in Python packages as a means to improve software measurability for Python packages.

I’d suggest rewording this something like the following:

While SBOM data can be derived automatically for many projects, when a project includes software components which cannot be identified automatically (for example, vendored non-Python code) this PEP provides a way for the project to include SBOM documents declaring those included components.

That doesn’t read very well - my lack of understanding of the subject matter is showing. Hopefully what I’m trying to say is clear enough, though.

3 Likes

In this case, I just don’t like the design :slight_smile: I want my files to be able to move from where they are in the source tree (which should match where they are in the sdist) to be somewhere more convenient in the built distro.

For example, my source SBOMs might be in ./dev/tooling/security/pregenerated/sbom/2.2/*.json, but I want them to end up in <package>.dist-info/sboms.

Why artificially restrict this, when it’s so easy to just say “paths in PKG-INFO in an sdist may not be accurate, you should read the pyproject.toml in an sdist” like we do for all the other metadata?

Huh? That’s not what we say, is it? It certainly doesn’t sound like something I think we should be saying :slightly_frowning_face:

Having paths in core metadata is something that was newly introduced with license metadata, and I don’t think we considered this aspect of the problem. But the License-File spec seems relatively clear to me - the field defines the path relative to the appropriate root directory. If the layout of the file(s) in the sdist and the wheel isn’t the same, then the metadata field will differ between the sdist and the wheel, and it therefore must be marked as dynamic.

As usual, the value of a (non-dynamic) metadata field in the sdist must be used in preference to the value in pyproject.toml if there’s any conflict. Normally, there won’t be, of course, because PEP 621 says “tools CANNOT remove, add or change data that has been statically specified”. But the *-files keys might be considered different, because there’s nothing in the specs (that I can see, at least) which prohibits a build backend from reorganising the specified files. It’s a bit of a grey area (how would the user specify how the files get reorganised?) but I guess it’s possible.

@brettcannon as PEP-delegate for the license information PEP (and for this one!), did any of this come up in the discussions there?

I admit I haven’t looked into implementing the license ones into any of my projects yet, and wasn’t really called on to look at it in detail while it was being developed, so this is really the first time I’m going “does this fit into my build backend’s model” (which, for the most part, puts the user firmly in control, so they can declare whatever “Dynamic” they want and are fine from my POV).

The basic premise pymsbuild works on here is that the _msbuild.py build metadata references files as they appear in the source/sdist, and indicates where they should end up in the wheel/install. So the conversion from source tree to sdist is to export core metadata to PKG-INFO, and package everything up more or less where it sits. Then the conversion from sdist to wheel involves reading PKG-INFO rather than recalculating it, and performing the actual build steps to move files from their source locations into their package locations, including moving/generating files in dist-info.

So in order for the file to end up in dist-info at all, the user has to specify that’s where it should go. But it won’t go there at sdist time, it’ll wait until we’re building a wheel. But at wheel build time we try to avoid modifying core metadata at all because that’s Bad™.

Which basically means this design by design has to be dynamic metadata, or users have no freedom for their source layout to be different from the install layout - both of which are likely to be constrained by other factors. I’m very anti-constraint in cases like this, so I’ll just advise that it ought to be dynamic and show how to update at wheel time in any examples I write (if/when I get to it).

But it would be much less constraining if the PKG-INFO paths could match the final METADATA values without having to be directly resolvable within the sdist. That way they can be generated statically at sdist time, we can easily ensure that all the wheels for a project have the same metadata, and users aren’t forced to choose between satisfying their team’s code structure requirements, their customers’ SBOM tool requirements, their SBOM generation settings, and complexity in their build scripts.

1 Like

Nope, this being a concern for anyone is a new one.

OK. To be clear, I don’t have a personal need for any of this. I’m interested because my intent for PEP 643 was that people should always be able to prefer PKG-INFO over pyproject.toml in sdists. And if we don’t clear this up, we’re bound to get someone complaining that some tool “is wrong” because they interpreted the spec differently than the tool did.

For a real use case, a distro packaging tool consuming a sdist should be able to find the license and SBOM by statically reading PKG-INFO. They shouldn’t have to read pyproject.toml, or worse (for example with Steve’s build backend) a backend-specific configuration file, to find those files.

1 Like

Having looked over PEP 621 and PEP 643 again, I see that this is indeed where they lead.

I guess I’ll just give a recommendation that anyone using pymsbuild should always set those fields to be dynamic. (On the bright side, I can take out my ugly logic for “exporting” metadata from the backend-specific file back into a pyproject.toml, since 643 clearly says nobody should be reading from it again.)

I was thinking about this last night (don’t tell Andrea I was reading DPO just before bed :sweat_smile:), and I realized why this may not have been brought up before in both the SBOM and licensing case: the main audience isn’t people, it’s laws and tooling. In the license case the key thing is the license files are even there, not what their file name or directory structure under licenses are. For SBOMs, we care that tooling can find the files and process their contents, which @sethmlarson has tested and says works on the tools he could get his hands on.

There’s also the complication of file name clashes. Now I’m not aware of any software license that requires a specific file name for the license file, so I don’t think that’s important. But what we can’t say is whether some SBOM standard is going to mandate some specific file name under some specific directory name or something. And in that instance, if a build back-end moved files around and renamed them in the wheel it could break something. Now this is all conjecture that this could happen, but we have all seen worse ideas in standards as well (and I’m sure I’m the author of some of them :wink:).

I think my key point is this feels like bikeshedding (but I could quite possibly be misunderstanding the concern! I have a 9-month old so I will fully admit my mental faculties are not always available :sweat_smile:). Unless we expect human beings to be reading any of these files regularly and that users will somehow end up with a deep directory structure under sbom/ that problematic, I’m personally not seeing this as a big concern beyond directory cleanliness.

1 Like

I wouldn’t expect a build backend to move anything other than on the user’s instructions, so it’d be the user’s fault if it gets broken. I don’t think we should write specs that enforce stuff like this “just in case” the user gets it wrong, though others choose to differ on this philosophical point.

(Deep directory structures are regularly problematic on Windows, so that isn’t a non-issue, but it’s not a major one. Consider this a side point.)

Metadata cleanliness becomes an issue with any generation process other than “generate it in advance and commit it”. Perhaps I check in a metadata file for my SBOM tool, which generates a “full” SBOM based on my internal data (i.e. not available to the sdist installer), and then later trim it for specific platforms when we know the wheel tag. Now the field in pyproject.toml doesn’t match the field in PKG-INFO which doesn’t match the files in METADATA, and on top of that, I have to induce my build backend to include the files in a particular location in the sdist in order to match the metadata I’m eventually going to have in the wheel.

I’d be happier if there wasn’t any particular location mandated and just a requirement that the metadata contain the full path from the root of the archive (and probably a reminder that builds which move or change the SBOM files between sdist->wheel need to mark it as “Dynamic”).

That’s a whole other question of static vs dynamic data. If all that’s guaranteed “static” by the specifications is the file name, then why bother? To be honest, I agree with Brett in one sense - this is all just bikeshedding over details that no-one is really going to have a problem with. But unlike bikeshedding, it’s actually more about understanding the implications of the rules we already have, and deciding if we’re OK with them. There’s no actual decision to be made here - as you said, this is just where PEPs 612 and 643 lead, rather than being anything new in PEP 770.

And in reality, I don’t think anyone really cares that much whether metadata is static, except in certain very performance-sensitive cases like resolvers, where name, version and dependency data need to be available at low cost.

If people are OK with how things are specified in PEP 770, then I don’t see much point debating the details of static vs dynamic. On the other hand, if anyone does want to clarify things in the PEP, then they should be aware that they might end up having to go back and revisit decisions made in PEPs 612 and 634.

Personally, I’m fine with leaving things as they are in PEP 770 on this topic.

1 Like