Draft PEP: Adding Vulnerability Data to the Simple API for Package Indexes

Building on PEP 691 and PEP 700, I’ve drafted a PEP that adds the vulnerability data currently present in PyPI’s legacy JSON API to the PEP 691 Simple API.

There is an open PR on my fork for inline comments/suggestions/discussions here: DRAFT: Simple Vulnerabilities API by di · Pull Request #4 · di/peps · GitHub

I’ll leave that open for a while before merging and submitting the result as a PR to the PEPs repo.

3 Likes

How much extra data does this add? Would it be better hosted as a separate file (like the metadata) to avoid bloating the simple API too much? Maybe put the top-level data in a separate file, pointed to by the API, with the release-level IDs still included inline?

1 Like

If projects don’t want their distributions page to become bloated, they could just release code without security flaws :smiley:


Edit: I didn’t realise that vulnerability-ids was mandatory for every file. Not much a fan of that

1 Like

Currently there are about 500 projects with at least one vulnerability (out of ~425K projects) so for most responses (99.88%) this would only add the empty vulnerabilities/vulnerability-ids fields. Of those 500 projects, the average # of vulnerabilities is about 6. A single vulnerability usually looks something like this, where details is fairly brief and summary is often empty:

"GHSA-jrh2-hc4r-7jwx": {
  "aliases": [
  "CVE-2021-45452"
  ],
  "summary": null
  "details": "Storage.save in Django 2.2 before 2.2.26, 3.2 before 3.2.11, and 4.0 before 4.0.1 allows directory traversal if crafted filenames are directly passed to it.",
  "fixed_in": [
    "2.2.26",
    "3.2.11",
    "4.0.1"
  ],
}

Given that these responses are already fairly large (because they list every distribution for every release), I’m not too concerned about bloat – the responses for large projects that may have multiple vulnerabilities will be more less dominated by the large number of releases/distributions they make.

1 Like

Thinking about this some more, I have more serious concerns than the data volumes. (I’m still not happy with the “arbitrary text” nature of the details field, but I guess that ship sailed with the “yanked” field, so I’m willing to let it drop).

First of all, is per-file the right granularity for this data? In the JSON API, the data is at the release level. The more I think about this, the more it feels to me like a bad fit for the simple API.

The only justification for the addition of this data in the draft PEP is “This PEP adds data which were previously only available through the JSON API, in order to allow more clients which were previously Warehouse specific to support arbitrary standards-compliant indexes”. But that’s an entirely generic statement, and it’s in direct conflict with the FAQ from PEP 700 (which you defer to in this PEP) saying that

Proposed additions to the simple API will still be considered on their individual merits, and the requirement that the API should be simple and fast for the primary use case of locating files for a project will remain the overriding consideration.

What are the merits of adding this field specifically to the simple API? Are there consumers which currently use the simple API for most of their data, and fall back to the Warehouse-specific JSON API for vulnerability data? That was the justification for the fields added in PEP 700, and it doesn’t seem to apply here. Also, what indexes other than PyPI are maintaining vulnerability data, and have they expressed a need to expose that data in a standards-compliant way? If the data is only served by PyPI anyway, what’s the urgency for standardising it?

I was deliberately careful when writing PEP 700 to make it clear that there was not a license to add data to the simple API just because it existed in the JSON API. I feel that this proposal undermines that intent, if only by not arguing for the change on its own merits.

In general, I don’t see any advantage to simply moving data out of the JSON API unless there’s a real prospect of retiring that API. And that seems to be a long way off, as it’s going to need PEP 658 to be implemented in Warehouse. So I’d argue that consumers should just continue to use the JSON API until there’s sign of movement on that item of work.

2 Likes

Thanks for the feedback!

I think it’s probably uncommon, but not unreasonable that a vulnerability might exist in, say, the Windows distributions for a given package, but not for other platforms. Making this per-release means that this API could never support that level of granularity (and making sure data is per-file, and not per-release, is something PyPI is generally working towards).

Overcoming the lack of standardization is the primary merit. Other indexes can’t really offer vulnerability data via the JSON API (or really the legacy JSON API at all) and guarantee compatibility because it’s entirely unstandardized. Clients can’t confidently integrate against it because it’s unstandardized and ‘legacy’.

Yes, this is what existing auditing tools like pip-audit do: they use the simple API for dependency resolution, and fall back on the unstandardized JSON for vulnerability data.

None, as far as I’m aware (again, because it’s unstandardized). The primary need would be for mirrors/proxies to be able to mirror vulnerability data in a compliant way.

This just seems like a chicken/egg problem to me: if we’re resistant to any effort to standardize parts of that API because it doesn’t look like we’ll retire it, we’re never going to retire it.

I’m not against standardising it, I’m just not convinced it’s a good match for the simple API. There’s bound to be some compromises here, creating a whole new (standard) API for nothing but vulnerability data is also not ideal. But PEP 503 and PEP 691 made no suggestion that the simple API was in any way an alternative for the JSON API, and PEP 700 was careful to avoid implying that either.

PEP 700, in particular, only added fields that would be of potential use to installers (or tools that wrap the installation process), the key users of the simple API. And that was deliberate.

If we want to change that position and say that the simple API now is going to be the single API for indexes, can we make that discussion explicit? In particular, I have a chunk of the XMLRPC API (the mirroring protocol) that I use frequently[1], that I would like to see standardised, but I have deliberately avoided suggesting for the simple API, because I expected a different successor for the JSON/XMLRPC APIs. If we put vulnerability data into the simple API simply because it’s “how we standardise index data”, then we’re both overloading the simple API, and making it harder to find any other way to standardise the mirroring API.

This will ultimately be @dstufft’s call, as PEP delegate for Warehouse related issues, but my recommendation would be to define other standard endpoints for what’s left of the JSON and XMLRPC APIs and keep the simple API “simple”.


  1. But not for simple mirroring, so “why not just use existing mirror software” isn’t a helpful answer. ↩︎

1 Like

I agree with you that the Simple API shouldn’t become a catch-all API. However, (and I think we’ve had this discussion before), I’m of the opinion that vulnerability data is absolutely useful for installers – in fact, I think they should be the primary use case. Additionally, audit tools (like pip-audit) that aren’t technically installers do currently wrap the installation process to resolve dependencies and report vulnerabilities.

2 Likes

FWIW, as a concrete example of vulnerabilities that impact only some artifacts: Vulnerable OpenSSL included in cryptography wheels · Advisory · pyca/cryptography · GitHub

1 Like

See here. Was there a particular reason you didn’t add a link to that discussion and in particular that comment, into the PEP and the announcement? Or was it simply that you’d forgotten? Maybe you’re close enough to the problem that it all seems obvious to you, but as an outsider, I would appreciate the background and implications being spelled out explicitly in the PEP.

My reading, specifically, is:

  1. The pip-audit project uses this data to do its job. At the moment it uses the JSON API.
  2. The pip-audit project has aspirations to become a pip subcommand.
  3. To be a pip subcommand, one of the assumed preconditions is that standardised interfaces should be used, so vulnerability data needs to be made available via such an interface.
  4. Pip uses the simple API as its fundamental access to the index, so with this background, having the vulnerability data available via the simple API would make integration into pip simpler.

All of which is good, if you accept the various assumptions and goals, but shouldn’t be left to the reader to infer.

I’d also like to note that “vulnerability data is available via a standardised interface” is a pretty minor part of the hurdles involved in integrating pip-audit into pip, so point (3) is a rather weak argument here.

I’m still -1 on this, but that’s hardly surprising as I was never a particular fan of the idea of a pip audit command. As you say, we’re going over old ground though, so I encourage readers to read the linked discussion, rather than me repeating the same points here.

Trawling through the pip-audit discussion, I found this:

I had forgotten that previous discussion, and as a result raised this again here. It really isn’t a useful way to spend people’s time, expecting them to hunt out all of this context. It’s the job of the PEP to collect and summarise prior discussions and conclusions in support of the proposal (as well as prior objections, so that they can be addressed in the PEP). While I appreciate the PEP is still just a draft, I think it needs a significant amount of fleshing out before submitting it for public review.

FWIW, while I do share the concerns around the specific nature of how much bloat this might trigger on responses (I had that concern even with PEP 700’s keys), I do think this data is useful for installation tooling eventually.

(made major clarifying edits to next para)

Putting the big/unbounded part of it (the vulnerabilities key) into a separate file resolves most of the response size concerns. It comes at the cost of an extra network request, in exchange for very little effect on installers that don’t care (similar to the simple API’s handling of METADATA files). This would be while keeping the IDs on each file – which has the advantage of allowing installers that do care to avoid paying this cost if there’s no vulnerability in the specific file they’ve picked. IMO, this might be a reasonable avenue for both (a) resolving the response bloat concern and (b) making it possible to have the same info between both the HTML and JSON sides.

I’ve mentioned this on the PR but I’ll mention it here too: I’d like to see this approach be more generic for “advisories” rather than just “vulnerabilities” – but at least, let’s avoid locking ourselves out of that improvement in the future.

All that said, I’d also be fine with this PEP as-is – just that I feel like we might be able to do better, but I am also aware that perfect is the enemy of good especially when we’re a distributed group that’s primarily operating on motivated folks driving things. :slight_smile:

And, yea, documenting the missed references and concerns raised in past discussions is something that should be done in the PEP. :slight_smile:

1 Like

I forgot that we had discussed this there as well. Sorry to waste your time, I will add it to the PEP.

2 Likes

These things can happen.

Would providing affected artefact hashes make sense in a vulnerability entry? OSV schema probably does not have such field (yet?).

Providing artefact hashes and subsequently linking them with information on client side (see PEP-705 proposal for example, or lock files) could help with false positives. Maybe it would require considering more maintenance on the PyPI side. For example, packages with different wheel tags uploaded after a vulnerability is disclosed will need to be checked for the given vulnerability. Also, worth to consider self-hosted patched versions which will have different hash. Are they the same packages for which there were reported vulnerabilities?

Also, mirroring could be much easier by providing vulnerability information in files that would live next to artefacts on PyPI. Consumers would just sync files instead of providing compliant JSON API implementation. The syncing implementation could go directly to bandersnatch, for example. If vulnerabilities would be considered in pip during resolution/audit, pip could consider vulnerability information automatically also for mirrored indexes - users would just point it to the mirrored package index with vulnerability files.

Nevertheless, if I would consider myself strictly PyPI consumer, JSON API is much more straightforward way to get vulnerability information. What about combining these approaches? That is, keeping just a listing of vulnerability ids in JSON API (which is in most cases what client tools would like to consume anyway on JSON API) and have files with more detailed description available per vulnerability (and maybe one top level vulnerability file).