Pre-PEP discussion: Project status markers in the Index APIs

Hello all!

I’m opening this as a follow-up to PSA: PyPI now supports project archival and Adding a mechanism to deprecate a published project.

Context

Python packaging has three conceptual sets of “lifecycle” states:

  1. There are classifiers for development status, e.g. Development Status :: 7 - Inactive for an inactive project. These are defined in project metadata meaning they’re defined per-file, not per-release or per-project, and have no effect on resolution/upload semantics/etc.

  2. There is “yanking” as defined in PEP 592. Yanking is a soft-deletion mechanism that tells installers not to install a version/file unless that version/file has been precisely locked to.

  3. PyPI itself has project statuses, which have both user- and admin-states. For example, PyPI administrators can “quarantine” a project, which has the effect of marking all project releases/files as yanked until further action. Separately, project maintainers can “archive” a project, which does not affect resolution in any way but disables new uploads to the project and signals (currently only in the Web UI) that the project is no longer active.

    • In addition to these states, there are “facts” about a hosted project that are defined by how that project was uploaded or otherwise processed that can’t be encoded in the project metadata at upload time. For example, PyPI knows whether a given file was uploaded via Trusted Publishing, but the index APIs have no way to signal that state.

(1) and (2) are both currently represented in the index APIs (both HTML and JSON), but (3) isn’t. I think we should expose (3) too!

Proposal

The index APIs should have additional fields (or a single, composite field) that allows the index itself to express a project’s status markers.

From a review of PEP 503 and PEP 691 and their living PyPA counterparts, I believe that the “meta” component of the index responses is a good candidate for storing this information. From the living spec:

  • All JSON responses will have a meta key, which contains information related to the response itself, rather than the content of the response.

Ref: Simple Repository API - JSON Serialization

My interpretation of that is that any metadata the index knows about a project is “information related to the response itself,” rather than “content of the response.”

Furthermore, because project statuses and similar are at the project level rather than the release/file levels, it would be confusing and duplicative to express them at the latter levels (which is pretty much all the Index APIs specify).

Concretely, this is roughly what I envision, in both JSON and HTML index forms:

JSON:

{
    "meta": {
        "api-version": "1.3",
        "project-markers": {
            "project-status": "archived",
            "x-trusted-publisher": true,
        }
     },
     "name": "holygrail",
     "files": []
 }

HTML:

 <head>
    <meta name="pypi:repository-version" content="1.3">
    <meta name="pypi:project-markers:project-status" content="archived">
    <meta name="pypi:project-markets:x-pypi-trusted-publisher" content="true">
    <title>Links for holygrail</title> 
</head>

Or in prose:

  • Both index API formats gain a new meta namespace, project-markers (please suggest a better name!)
  • project-markers is a key-value mapping of project status identifiers to values. An empty mapping or missing mapping has no semantics.
  • There are two kinds of project status markers:
    • Markers that begin with x- are particular to the index that serves them. Their semantics are defined by that index, and mirrors SHOULD NOT copy those markers unless sensible in that mirror’s context.

      I’ve given x-trusted-publisher as an example: an index (like PyPI) may wish to set that to indicate that a project has one or more Trusted Publishers registered to it, and mirrors may or may not wish to re-mirror that state (maybe they do for policy reasons, or maybe they don’t to avoid implying that the mirror itself supported Trusted Publishing).

    • All other markers have well-defined meanings and values that will be both specified in the PEP and kept updating in the subsequent living PyPA spec. To start I propose only a single marker: project-status, which will start with only two possible values: archived | quarantined, corresponding to the current lifecycle states known to PyPI.

Implications

Project status markers have no direct downstream implications for installing clients: unlike yanking, the presence of a marker itself does not affect resolution.

The goal with placing these markers in the Index APIs is to allow installing clients to eventually (if they so choose) support user control/visibility over project states. For example, a user may want their installation step halted entirely if a project becomes quarantined, or may want a warning report containing the list of archived/inactive projects that they depend on.

Over time, another implication of this feature is that the Development Status classifier namespace becomes less useful. This is already somewhat true because of its inclusion in the metadata (meaning that projects have to do a new release to change it, and that it’s tied to the release/file cycle rather than top-level project state). However, one potential outcome is that these classifiers could be deprecated and removed over time.

Alternatives considered

An alternative to exposing these states in the Index APIs is to expose them in non-standard APIs instead, e.g. PyPI’s pre-existing JSON APIs. This would allow adoption by installers and other index clients that don’t use the standardized APIs, but would hamper adoption by those that choose to stick with the standard ones (and IMO we should be encouraging standard use as much as possible!).

Open questions

  • Does the layout proposed above make sense? Is it too verbose?
  • Is this proposal too open-ended? In particular, are x- markers a bad idea?
    CC @miketheman @dustin @sethmlarson as parties who I know/suspect will be interested in particular :slightly_smiling_face:
6 Likes

Do other index implementations support archived or quarantined status? I’m wary about standardising features that are PyPI specific here. Conversely, what project level statuses do other index implementations support?

Also, do we have any evidence that there’s an actual use case for this data? The examples of why this is valuable are all theoretical - I’ve seen no feature requests for pip to ignore archived projects yet. OK, it’s early days, but maybe this proposal should be put on hold until there’s evidence of a demand for it?

1 Like

Not that I’m aware of, but I think these states are conceptually generic across indices (in comparison to Trusted Publishing, which is arguably more tied to PyPI and therefore harder to justify in a standard IMO).

You probably understand the balance here better than I do, but I think it makes sense to standardize things that PyPI adopts first that are not conceptually limited to PyPI. PEP 740 (index attestations) and PEP 691 (the JSON Index API) would be examples – AFAICT no other indices have done PEP 740 (yet) and most 3p indices support PEP 503 instead of PEP 691, but each is conceptually generic across indices and therefore justifies a PEP.

I could do a survey for this, but anecdotally a lot of people have responded positively to the ability to archive their projects (@ncoghlan and @hugovk off the top of my head). I think the logical next step from there is being able to expose that state programmatically. There’s also some discussion/wish-casting on the PyPA Discord about pip warning on archived projects (screencap below).

Separately, I believe there’s been interest in this feature from other packaging tool maintainers/index consumers: @konstin has said that they’d like to see it in the APIs, and I’m aware of a handful of tools that currently do heuristics for “project abandonment” that would prefer to have a concrete signal instead. I’m also personally interested in this being in the API for e.g. pip-audit, since it’d allow us to add an extra dimension to the dependency report :slightly_smiling_face:

1 Like

Having this flag in the API would give us the opportunity to inform users that they have archived projects in their dependency tree (without changing the resolution itself).

Implementation-wise, I’d bias towards flatter JSON, but either works.

3 Likes

I have the same concerns about PEP 740, to be honest, and I would have been looking for some comment from other index maintainers on that one too if I’d been PEP delegate. But I wasn’t (and I don’t expect to be on this one either) and the status of “index PEPs” is in my view a bit unclear - are they specifications of PyPI’s behaviour for others to follow, or interoperability standards defining how all indexes should behave (with support and buy-in from the community of index implementers)?

The JSON index API is slightly different, in the sense that it’s optional and adds no new capabilities. You can technically “support” PEP 691 by not implementing the JSON API, and a conformant client has to be prepared to fall back to the HTML API. JSON is just easier :slightly_smiling_face:

I didn’t mean the ability to archive. That’s clearly of interest. And I’m not even arguing against exposing the state - that can be done as an index-specific extension without standardisation (although I agree that’s not something we should do lightly, as it makes the transition to a standard harder in the long run). I do think that if other index implementations aren’t expecting to support project lifecycle states, and particularly an “archived” status, we should be asking them why.

I am asking about consumers using this information. So the examples you gave there are useful, and I think they should be expanded in the PEP, and form the core motivation for the proposal.

As far as pip goes, IMO it’s a bit early to be speculating about warning on archived projects. The “quarantined” status is irrelevant to us, as it results in all files being marked as yanked, so we get all we need from the yanked status. But archival is explicitly stated as not affecting resolution, and I’d sort of like to see how projects use the flag before committing to a (potentially noisy) warning for users of pip. We’ve never been asked for a warning on projects with a classifier of “Development Status: Inactive”, so I’m not entirely clear why this is any different. Is it just that project classifiers as metadata don’t actually have much credibility?

For now, let’s simply say that I wouldn’t want pip presented as a compelling use case for this new data…

Overall I’m +0 on this. There’s no huge reason not to do it, but I don’t like the idea that we’re developing the standard index API purely based on PyPI’s capabilities rather than on consensus between index implementers. Why didn’t we simply retain the old PyPI JSON API if that was the intention?

2 Likes

Just to make sure I understand: are you thinking something roughly like this?

"meta": {
  "api-version": "1.3",
  "project-status": "archived"
}

i.e. no intermediating project-markers object? If so I have no particular objections to that; curious what others think (maybe it makes disambiguating future members of meta a little harder, but also maybe that doesn’t matter?).

Yeah, perhaps it’d be good to have a clarifying scope around them – I don’t know! In my PEPs I tend to treat PyPI implicitly as the “main” index, and the “index” PEPs reflect aspects of PyPI that might be particular to PyPI for a period but are not inherently/architecturally so. I think this strikes a decent balance between things that are/aren’t relevant to other index implementors, since e.g. I imagine they’d be annoyed if PyPI’s 2FA/other auth changes were put in a PEP). But I honestly don’t know; I’m speculating :sweat_smile:

Thanks, I’ll make sure to include those in the motivating section.

This is more speculation, but I think indices that mirror PyPI will be interested in this information as well: to my understanding it’s somewhat common for large companies to have internal mirrors of PyPI that get updated somewhat regularly, and being able to track project status would be a boon to internal “supply chain” tracking efforts.

I think it’s that, plus the direction of effect: it’s somewhat common for people to want to archive a project after years of inactivity, meaning they don’t necessarily want to publish a final version with an updated project classifier (especially if the project’s build/publish steps have bitrotted significantly). So being able to set the marker at the index level can be a time/complexity-saver.

5 Likes

This is somewhat off-topic, as procedurally I have no involvement in the approval process for package index PEPs, but IMO it would be good if you could reach out to some other index implementations - not just mirrors, but also standalone implementations. Two obvious examples that come to mind are devpi (open source) and Artifactory/JFrog (commercial). I know pip gets a lot of feedback from Artifactory users, so knowing Artifactory’s view on new index PEPs would in general be a good thing. I think Microsoft Azure also have an index implementation, and maybe Github do as well?

One other thing - we should be careful not to block future proposals. The versioning of the JSON API is “linear”, so an index can’t support version 1.5 (say) without also supporting 1.4. So if 1.4 is the version that adds project-markers, then it should be done in a way that indexes which don’t support any form of marker can still implement version 1.5. That’s probably as simple as just making everything optional (although that sort of defeats the purpose of having a version in the first place…)

1 Like

Thanks, I appreciate you raising this. I’ll make efforts to contact each of these (plus Sonatype, etc.?) so they can raise concerns/issue feedback during the PEP process as well. I’m not super familiar with Microsoft’s offerings here, but maybe one of the CPython core devs who works there would know/have an appropriate contact.

Agreed on both counts – in effect everything is really just v1 with options, but I guess the index PEPs have established a precedent of bumping the version when a new set of optionals is added. The PEP that comes from this will definitely not make any markers mandatory!

I’ll make sure they see it. In general, they’re pretty happy to follow PyPI (too happy, in my opinion, when they really ought to be allowing things that PyPI forbids, like arbitrary platform names…).

FTR, the product is Azure Artifacts and it’s part of the free (for public project) tier of Azure DevOps. We require it at work (it handles proxying to PyPI), so I’m always happy for it to have all the functionality the installers may need.

GitHub has no Python packages support, despite being based on the same codebase as Azure Artifacts, at the PSF’s request as I understand it (I haven’t spoken to the GitHub team, only people on the Python side).

And for what it’s worth, I incline to the direction Paul started in, that this might as well be optional PyPI-specific attributes initially, and let the standardisation result from other indexes copying or creating their own. PEPs aren’t a requirement to have a documented, stable API. They’re just a helpful process for managing multiple stakeholders with conflicting requirements.

1 Like

I’m traveling this week and can’t get into the details just yet of this post, but I wanted to correct this.
Quarantine doesn’t perform the same as yank - the quarantined files are removed from the index responses and a journal entry is written to signal to mirrors that they can remove if they want to.

Clearing from quarantine puts them back in, and also signals to mirrors that a changelog has happened so they can react if they wish.

2 Likes

My main interest in archiving was from the publisher side, but I’d definitely see it as advantageous for dependency auditing tools to be able to flag when people are still depending on an archived project.

However, from a metadata format point of view, the meta field is NOT the right place to put this info. That field is only for “Read this first” metadata that tells the consuming tools how to read the rest of the metadata.

The only way I could see the meta field being involved is if instead of standardising PyPI’s project states directly, we instead defined the index server equivalent of the tool table we have in pyproject.toml:

  • a meta.api-server key with a string identifying the specific API server implementation
  • a server-extensions key in the main body of the response with any server-specific metadata (such as project states)

(I’m not actually proposing that, it’s just an example of a proposal that would legitimately have a claim to adding a new meta key)

2 Likes

Agreed. As this is the project endpoint, project level data should be at the top level:

{
    "meta": {
        "api-version": "1.3"
     },
     "name": "holygrail",
     "status": "archived",
     "x-trusted-publisher": true,
     "files": []
 }
<head>
    <meta name="pypi:repository-version" content="1.3">
    <meta name="pypi:project-status" content="archived">
    <meta name="pypi:x-pypi-trusted-publisher" content="true">
    <title>Links for holygrail</title> 
</head>

There’s no need for all those project- prefixes, or the project-markers grouping. (I left the field as project-status in the HTML version as pypi:status feels a bit too generic. But having special cases like this is not ideal, so maybe that’s a bad idea…

I’ll also point out that I’m pretty sure non-standard data like x-trusted-publisher is not allowed. Index-specific fields should be prefixed with an underscore, like PyPI’s _last-serial. TBH, I couldn’t find this in the spec, but I’m sure it was the intent to reserve all names that don’t start with an underscore for future standards.

Having said this, I’m not excessively keen on finding that we suddenly have a huge set of unstructured “project level” data. So I’d rather see some common themes captured as groups of data. That’s not what we have here, though. If we cut away everything but the core proposal, this is a PEP to add a single project-level field to the simple API. That seems a little excessive - would we not be better batching things up until we can have a PEP that covers a more substantial change? I don’t want to have the process driving things (in the sense that admin cost is what determines whether we can make a change) but equally I don’t think we should ignore the fact that every PEP is churn in our standards, and we owe it to the users to minimise the level of that churn. It’s important to remember that other index implementations are already slow to keep up with the various standards we have - the more we keep adding new standards, the more likely it is that they will simply start to not bother :slightly_frowning_face:

Perhaps we could keep it as a mapping to avoid the prefixes/unnecessary genericity? Something like:

 {
    "meta": {
        "api-version": "1.3"
     },
     "name": "holygrail",
     "markers": {
         "status": "archived",
         "_trusted-publisher": true,
     },
     "files": []
 }

I think this clearly captures the underlying point (these are all markers of project state, not free-floating pieces of project-level data), to your observation that it’s not ideal to have an open-ended allowance for keys at the project level.

I agree with this pretty strongly when the standards have immediate implications for consumers (e.g. big metadata changes, like the recent stuff with license representations). However, I think in this case both the admin and user costs are pretty low (since these are optional fields that unlock optional behaviors in API consumers). In principle, there should be no downside to indices choosing not to implement them (and we could even make their optionality a strong component of the PEP, e.g. “adding markers is strictly optional and NOT REQURED for index API compliance.”

(With that being said, my bias here is of course towards PyPI, since that’s what I have experience with. I’d love to hear from representatives of other index implementations to better understand whether this is burdensome/could be made less burdensome.)

I think having package metadata be exposed somewhere programmatic like this makes sense to me. I suspect this information will be interesting to many downstream users, either folks with their own internal mirrors or that are redistributing packages (ie Linux distros). In general having a place that is high-signal and explicit is much much better than what is “state of the art” today (using “activity” as a proxy for maintenance status).

It would be good to talk to some of these folks to get them to weigh-in on whether exposing the projects’ maintenance status would be actually used. I know in the past we’ve received requests for the Trusted Publisher status, but we would prefer exposing the materials to verify for yourself instead of trusting that PyPI is verifying artifacts (and we’ve done that w/ provenance).

The “lifecycle states” are implemented this way in PyPI, but I can imagine an archived package being quarantined, are we sure these two states are mutually exclusive?

This is 100% true in my case, after the feature was released I immediately archived ~20 projects. I can’t imagine doing “one final upload” to each of these projects to set a classifier.

2 Likes

Perhaps this is off-base, but can we avoid calling these “markers”?

The term already has meaning vis-a-vis dependency specifiers and I think the result will confuse users sooner rather than later.
Not understanding the difference, I would expect users to try to use these fields in dependency declarations and be confused about the results.

(I don’t think it should be confusing. But I think it will be.)

I don’t have strong feelings about the name beyond that. I’d tend towards the utterly generic “project attributes” or something like that.

5 Likes

Great point. I suppose it should really be statuses: list[str] for full generality. That’s not quite as nice of a model but I suppose that’s the cost of reality :slight_smile:

Not at all, I appreciate you pointing this out!

I don’t have a strong opinion at all about what this gets called :slightly_smiling_face: – attributes sounds find, although that could also be confusing since metadata are also attributes in a sense. Project flags, project notices, project index states?

@sirosen raised the terminology conflict point I was going to bring up.

Taking a step back from the specifics of bikeshed painting, I want to take another look at that tool table comparison, specifically these aspects of it:

  • PyPI as the central index server maintains additional state regarding projects that other index servers may, or may not, mirror
  • it would be convenient if PyPI could make that information available programmatically without needing to propose standardising the relevant properties across all index servers

Given such a mechanism, PyPI could initially publish server specific project metadata, and then only pursue wider standardisation if other implementations wanted to do more than merely replicate the PyPI fields, or if consuming tools wanted more certainty around the stability and availability of particularly useful fields.

For example:

  • Define a server key in JSON API responses (do we have a compelling reason to offer this new information via HTML?)
  • Partition the server namespace by reverse DNS (so PyPI keys would be under org.pypi)
  • Put any PyPI specific metadata in that new substructure
2 Likes

With today’s implementation, they are not - if an archived project is quarantined, it will no longer be archived, the project model’s lifecycle_status will change. This was chosen instead of making the table further wider with even more statuses and even more boolean checks.
There was some conversation on making that an FSM, but that didn’t get implemented yet - if you want to take a crack at it, please do! :wink:

I’ll be watching for any case under which an archived project is subject to quarantine to validate if this case ever happens, and we can figure that out if/when it does.

1 Like

I like this idea a lot! Going back to your earlier comment: are you thinking this would live under meta i.e. meta.server, or would it live in the “main” part of the response? If I understand the PEP process correctly the latter still requires a PEP (even if the PEP is just “servers can use this key”), while the former might not?

Re: presence in the HTML response: I don’t have a compelling reason. I figured it might make it easier for pip to adopt (should users request it), but IIUC pip also supports PEP 691 so this is a moot point.

I like the approach of having an Index-specific section in an API response that indexes have flexibility in what they return to clients. Agree that it should be optional, not required, and my preference is to not add more top-level keys to the response object than absolutely necessary.
An important aspect of this proposal could declare that indexes have the flexibility to evolve this section without a PEP. As @pf_moore calls out, changing standards involves significantly more effort than advertising changes in a subsection. - so until we’ve gathered enough experience from PyPI implementation and usage, as well as other index usage, it should probably adopts a “hey, anything could be in, or removed from, this key at any time, write your clients accordingly”.
This feels a little like the Open Source Vulnerability format’s database_specific field. The OSV format is in use over in GitHub - pypa/advisory-database: Advisory database for Python packages published on pypi.org

This would allow some degree of experimentation within the confines of the existing API, without having to develop another “draft API” secondary endpoint/models/process.

I like the idea of using a reverse DNS namespace. We should acknowledge they aren’t forever-names, as PyPI-the-conceptual-service itself has gone through multiple iterations of DNS names. Using DNS names further ossifies the name into the ecosystem, vs using DNS names as a locator. (e.g. pypi.python.org serves ~50 million redirects a day to pypi.org). I haven’t dug into the whys and wherefores of that.
Yes, I know they are baked into clients, so they are pretty ossified today, but nothing in the API responses currently contain the string as an identifier.

1 Like