Pre-PEP discussion: Project status markers in the Index APIs

pf_moore · February 5, 2025, 11:05am

I’d prefer to go with the key “tool”. It’s arguably less accurate (is an index server a “tool”?) but it reuses the familiar reserved name from pyproject.toml.

Why wouldn’t we offer the data via HTML? I’m concerned that if we continue adding data to the JSON API without it being in the HTML API, we’re deprecating the HTML API by default, which was something PEP 691 explicitly said we weren’t doing. In PEP 700, I added some fields only to the JSON API, but at least I included a justification for that decision in the PEP (a justification I regret now, but that’s a separate point). I’d support deprecation of the HTML API as a PEP in its own right, but while we have it, I think the default should be to include new data in both APIs, and it should be necessary to justify excluding it, rather than needing a compelling reason to include it.

I’m a strong -1 on reverse DNS. Do we really want test.pypi.org to have a different metadata field than pypi.org? Do we want every instance of devpi or Artifactory to use its own namespace? Apart from the obvious redundancy, this would make using the data impossible, as consumers would have no idea what data was available based just on the namespace. Instead, I propose using the server implementation. I don’t want to be as strict as pyproject.toml (we want PyPI to use the key pypi, not warehouse, IMO) but a flat namespace with server implementations claiming keys sensibly^[1] seems perfectly workable. Mirrors need special consideration - it’s OK for a mirror to serve data for another implementation as long as that data is identical to the data on the base (so mirrors can’t “override” data locally) - but otherwise only the server implementation that owns the namespace is allowed to publish data in that namespace.

It’s worth stating that this scheme is implementable now, simply by using a _tool namespace rather than tool. Standardising how the “tool” namespace is managed is worthwhile, but ultimately not necessary.

One downside of not standardising the actual fields and their meanings is that I’d be reluctant for pip to warn on use of an “archived” project. Without a formal definition of what “archived” means and how it’s represented in standard metadata, we’d have to tie ourselves to implementation-defined details, and while PyPI is a special case, we thought for years that setuptools was a special case, and getting out of that assumption has been a massive undertaking

I’ll note though that as it stands, the pre-PEP here doesn’t actually define what “archived” means, even though it standardises how to represent that status. So I’d still be concerned about warning in pip - we’d be adding semantics that project maintainers might not want when they archive a project^[2].

i.e., using the name “everyone knows them by” ↩︎
I could easily see a project wanting archival to mean “OK for exising projects, but don’t use it in new projects” and a warning doesn’t represent that correctly ↩︎

ncoghlan · February 5, 2025, 1:02pm

I’m suggesting a new key in the main part of the response, since this is server-specific information about the same entity as the rest of the response, not information about the response itself ^[1].

The associated PEP could also still use PyPI’s lifecycle data as a motivating example, the solution would just be more general (and explicitly based on the tool table precedent of allowing things to be standardised after successful implementation specific experimentation).

Sticking with "server" as the tentative bikeshed colour, that proposal might look something like:

{
  "server": {
    "org.pypi": {
        "lifecycle-status": ["archived"]
    }
  }
}

The PEP would be standardising:

the server key
the reverse DNS partitioning of the server-specific fields

While it would not be formally standardising the actual lifecycle status flags for PyPI, it’s worth investing some time in making sure it’s at least a plausible design for that purpose (I called it lifecycle-status since it seems less ambiguous than “project-status”, and made it a list so status flags like “archived” and “quarantined” don’t have to be mutually exclusive).

Outside API versioning, the only other meta field I’m aware of is PEP 708’s “tracks” field, which genuinely indicates that the response being provided is meant to align with the project at the tracked URLs. ↩︎

woodruffw · February 5, 2025, 7:47pm

I’m content to include it in both APIs for this PEP, although as a potential compelling reason: if statuses are lists, then we need to define an equivalent HTML representation for their JSON representation. Similarly, if statuses are domain-separated, the we need to define how that maps to a set of one or more <meta> tags. I don’t think either of these is hard to do, but they complicate consumption and implementation – particularly if we end up inventing custom syntax inside a meta HTML attribute.

Thanks for clarifying!

Just to recap, two (slightly) conflicting ideas have been suggested:

If the PEP leaves the actual status flag/field semantics undefined, then isolating statuses by conceptual domain makes sense: one index’s archived is not necessarily another’s, so consuming tools should key on indices they understand. That conceptual domain can be a reverse DNS or something more symbolic (e.g. pypi/testpypi like Trusted Publishing uses, to prevent the domain assumptions that @miketheman raised).
On the other hand, the PEP could precisely define the status flag/field semantics, in which case (IMO) having statuses isolated by index/domain is less valuable: it’s not clear how a tool should consume conflicting status claims from different indices, or how mirrors should honor/forward statuses if the “upstream” index response contains multiple sets of statuses.

From an implementor’s standpoint, I lean slightly towards (2): a package is always downloaded from a specific index, and its statuses should be whatever that index says (if any). That also leaves us a pre-existing escape hatch for index-specific statuses using _somename: whatever as pointed out by @pf_moore.

So, iterating on the bikehed:

// I also like `tool` if people prefer that.
"server": {
  // lifecycle-status is list[LifecycleStatus], where each variant would
  // be defined precisely in the PEP.
  "lifecycle-status": ["archived"],
  // underscore-prefixed, so whatever the server wants (per existing allowances)
  "_trusted-publisher": true
}

emmatyping · February 6, 2025, 12:07am

For what it’s worth, while the index I maintain at work does not currently support this, I definitely would be interested in marking projects archived if they would emit a warning at install time. I’m hesitant to directly yank/delete these projects’ releases but would like to communicate that they are unmaintained.

ncoghlan · February 6, 2025, 12:32am

Between mirrors, caching proxies, and supplementary indexes that add extra binary files to upstream releases, the index server being queried is not necessarily the index server that defined a given extension field.

Consider a corporate caching proxy, which might report something like the following:

{
  "server": {
    "org.pypi": {
        "lifecycle-status": ["archived"]
    },
    "com.my-org.pkgs": {
        "review-status": "deprecated",
        "approved-uses": ["internal", "public", "distribution"],
        "maintainers": ["employee@my-org.com"]
    }    
  }
}

This example would be for an index server that allows tracking an internal review status, whether a package is approved for use internally, in public web services, or in software published for external distribution, as well as designating internal maintainers that can be contacted with questions, but the specific package being queried has been archived on PyPI.

This approach could be particularly useful for supplementary indexes, which don’t currently have a good way to report their own maintainer information in addition to the upstream project information.

woodruffw · February 6, 2025, 4:35pm

Agreed! I think I phrased my point badly: the statuses/server side metadata can come from a source that isn’t the immediate index being connected to, but at the end of the day everything is being intermediated by that index. Because of that, I think it’s nice to have a “flat” structure, otherwise we could end up with representations like this:

"server": {
  "org.pypi": {
    "lifecycle-status": ["archived"]
  },
  "com.my-org.pkgs": {
    "lifecycle-status": ["quarantined"]
  }
}

In the above, it’s unclear what a consuming tool should do: should it merge the two lifecycle-status keys, or select the most “local” one, or something else?

In situations like the above, what I’d propose instead is:

"server": {
  // the index is responsible for merging these
  "lifecycle-status": ["archived", "quarantined"],
  // underscored to emphasize that these are index-specific, but we could also define their semantics in the PEP
  "_review-status": "deprecated",
  "_approved-uses": ["internal", "public", "distribution"],
  "_maintainers": ["employee@my-org.com"]
}

With that, the server being described is always the one being connected to. Clients that want to then compare against an “upstream” set of status markers/states could then use the PEP 708 tracks to see what the originating index says instead. As a full example:

{
  "meta": {
    "api-version": "1.2",
    "tracks": ["https://pypi.org/simple/holygrail/"]
  },
  "server": {
    "lifecycle-status": ["archived", "quarantined"],
    "_maintainers": ["employee@my-org.com"]
  },
  "name": "holygrail",
  "files": [ /* ... */ ]
}

IMO this composes nicely with what’s been already standardized, and avoids a potential source of ambiguity around duplicate/conflicting states in a single response. The downside is that it’s a bit more complex on retrieval (clients need to be PEP 708-aware if they want to go beyond what the mirror/immediate index says).

barry · February 6, 2025, 6:52pm

My interpretation leads to an opposite conclusion, that project-markers should not live in the meta dict. Projects statuses are attributes about the project, which is “content of the response”. I think about it from a REST API viewpoint^[1]. If I’m looking at the representation of a project using a JSON view, then project-status and x-trusted-publisher are both properties of the project, not of the response.

My programmatic yanking PR makes a similar choice.

GitLab does support PyPI package registries.

Another option is to start to formalize a “Python Index REST API”, which would be optional, and implemented first in PyPI. Alternative indexes wouldn’t need to implement it, but if they wanted to support the same API, there would be a standard they could build to, and a versioning scheme / feature scheme that would provide consistency.

Agreed, but I would also think forwardly to a time when at least some of these properties are standardized. Do you think we’ll only ever have index specific attributes, or if you think standardization may eventually happen, where would these properties “graduate” to?

something that I would really like PyPI to build out more robustly ↩︎

pf_moore · February 6, 2025, 8:44pm

This ties in to my concern about different indexes giving different semantics to a concept. It feels very much to me as if we’re trying to define a generic mechanism, but without any real examples that let us understand the constraints. “Project status” as a concept is too vague to help, IMO.

This discussion was originally focused on the “Archived” status on PyPI. I think that’s a good place to start. Is that status intended to be purely a PyPI implementation detail, or do we intend to standardise it somehow (much like we did with yanked projects)? Because if it’s the latter, maybe we should start with a PEP like PEP 592, but for “archived” status. That way we can pin down the semantics of what it means to archive a project, and what clients should do when they are asked to install a project that’s archived. And we’ll have a standard that ensures that all indexes implement archiving the same way.

Or we can decide that it’s not worth it, and just represent the archived status as an index-specific _archived: <bool> attribute, and keep the standards process out of it.

Conversely, “quarantined” seems like it’s something that doesn’t really fit well with this proposal, as a quarantined project simply isn’t visible in the index (if I understand @miketheman’s response from earlier) so there’s nowhere to even report a status of “quarantined”.

ncoghlan · February 7, 2025, 12:35am

Clients shouldn’t be assuming that two custom server keys published by different servers are related just because they share a field name.

If a server wants to override one of the PyPI fields, it should just override the PyPI field (perhaps with its own metadata saying which fields have been modified). If it instead wants to publish its own field that coincidentally shares a name with a PyPI field, that’s what the namespace partitioning is for.

It’s the partitioning by project name that lets projects make their own rules for their tool.* metadata tables. While it doesn’t necessarily have to be reverse DNS for the index server use case, we would want some comparable separation mechanism here that indicates who was responsible for defining the semantics of the field, rather than where it was actually set when preparing the current response.

For example, a flatter alternative would be to allow arbitrary underscore prefixed strings in responses, and recommend that server implementations group their custom keys under a single such key:

{
    "_pypi": {
        "lifecycle-status": ["archived"]
    },
    "_myproxy": {
        "review-status": "deprecated",
        "approved-uses": ["internal", "public", "distribution"],
        "maintainers": ["employee@my-org.com"]
    }    
  }
}

(the _myproxy string would then be defined by the caching proxy implementation, without necessarily varying per proxy installation)

woodruffw · February 9, 2025, 11:26pm

I think my concern with this mirrors the one that @pf_moore raised: if a PEP here doesn’t actually standardize some of these keys, then it’s essentially a “no-op” from a standards perspective (since the living index standard already says that indices can insert underscore-prefixed keys).

(Which is maybe a good thing!)

Given the above, I’m now less sure that a PEP makes sense than when I started this thread . With all of the discussion around namespace separation and the observation that PyPI could just add an underscored key and call it a day, I’d love to hear thoughts from @dustin and @miketheman about the approach they’d prefer from PyPI’s side.

ncoghlan · February 10, 2025, 2:18am

Even if you don’t make a standards track PEP for this, it’s possible that an informational PEP along the lines of PEP 394 – The “python” Command on Unix-Like Systems | peps.python.org may make sense.

Such a PEP could:

Provide a pointer to where PyPI documents its server-specific keys
Remind private index server implementations that they can, and in some cases should, permit things that PyPI disallows (such as allowing projects with Private :: ... trove classifiers, dependencies on direct URL references, or @steve.dower’s example of arbitrary platform tags)

Edit: one potential title for such a PEP would just be “Recommendations for Python package index servers”

dustin · May 8, 2025, 7:14pm

Late to respond here but I think I agree: unless a PEP is able to define a standard across indices for:

a) what the status markers are;
b) what state changes are possible and what they mean;
c) how to ‘get’ the status of a given project.

I think unless we can do the first two in a general way, defining a PEP for the third doesn’t add a lot of benefit.

(Note that in this hypothetical PEP I don’t think every index should have to support every status marker, so some “minimum required state machine”, as well as a way for an index to indicate what statuses it supports, would probably be necessary).