PEP 753: Uniform project URLs in core metadata

This is a discussion thread for PEP 753: Uniform URLs in core metadata.

Summary: The PEP proposes deprecating the Home-page and Download-URL headers in favor of well-known labels within the already-standardized Project-URLs multi-use field.

Draft text here: PEP 753 – Uniform project URLs in core metadata | peps.python.org

Previous, pre-PEP discussion: Core metadata: should `Home-page` and `Download-URL` be deprecated? - #17 by woodruffw

(P.S. thank you @hugovk for reminding me to create this new thread, rather than reusing the pre-PEP one!)

7 Likes

You already incorporated my comments, so +1 from me.

There’s a MUST in the producers section that I think would be clearer if relaxed to a SHOULD (the labelling section it references only contains SHOULD and MAY requirements, so this doesn’t actually affect the meaning).

1 Like

It’s been two weeks since last activity, so I’m giving this a bump in the hopes of garnering some more feedback before requesting provisional acceptance :slightly_smiling_face: – my ego would love to believe that I can write uncontroversial PEPs, but I’d like to make sure it’s not just because it’s gone under the radar!

(+CC in particular: @sethmlarson @dustin @miketheman)

The PEP seems extremely straightforward and uncontroversial. As a PyPI maintainer and also as a previous Metadata PEP author, it has my full support.

I only have one note:

Package indices SHOULD NOT use the canonicalized labels belonging to the set of well-known labels directly as UI elements (instead replacing them with appropriately capitalized text labels)

This seems to imply that there is some canonical mapping from canonicalized labels to an “appropriately capitalized text label” (e.g. whatsnewWhat's New?) but that’s not included in the PEP. Perhaps that should be included in the “Well-known labels” section?

4 Likes

Thanks for calling that out! My thinking there was that the index might not want to be constrained in how it chooses to render an “appropriate” equivalent of the normalized label (e.g. for internationalization). But I can definitely add a non-normative list of text labels, along with a note that indices should internationalize them/apply them as they see fit :slightly_smiling_face:

Edit: PEP 753: Add suggested human-readable labels by woodruffw · Pull Request #3974 · python/peps · GitHub contains the suggested changes

1 Like

Thanks for writing up the PEP, it looks good.

A comment about the well-known labels:

In addition to the canonicalization rules above, this PEP proposes a fixed (but extensible) set of “well-known” Project-URL labels, as well as equivalent aliases.

The following table lists these labels, in canonical form:

Label Description Aliases
homepage The project’s home page (none)
download A download URL for the current distribution, equivalent to Download-URL (none)
changelog The project’s changelog changes, releasenotes, whatsnew, history
documentation The project’s online documentation docs
issues The project’s bug tracker bugs, issue, bug, tracker, report
sponsor Sponsoring information funding, donate, donation

And it says packagers and metadata producers should canonicalise them:

Packagers and metadata producers MAY choose to use these well-known labels to communicate specific URL intents to package indices and downstreams.

Packagers and metadata producers SHOULD produce the canonicalized version of the well-known labels in package metadata.

So the canonicalised versions will end up on PyPI, and not the alias the package author first chose.

I’ve made a list of the most-used project_urls in the top 8,000 PyPI projects, and grouped some variants. It shows the chosen canonical label is the most popular for most cases, and the aliases look like good choices.

  • Although funding (121) is much more common than sponsor (8) in joint 4th place.

  • And tracker (623) is more popular issues (251) in 4th place, although personally I prefer issues :slight_smile: You may consider adding bugtracker (453) and issuetracker (448) as aliases, and existing aliases bug and report don’t show up in the list.

  • Perhaps a new label to include is source (1184) with aliases repository (1027), sourcecode (396), and maybe github (152).

4 Likes

Ah, I think the phrasing I used here is probably too imprecise: I was using “canonicalized” to refer to just the normalization function, and not the mapping of aliases back to their “top-level” label variant. In other words, PyPI may encounter both issues and bugs as “canonicalized” labels, and should render both with the same UI element (currently suggested as “Issue Tracker”).

Do you think I could improve the phrasing to make that clearer, or should the PEP actually restrict the metadata generation more normatively here? My thinking with allowing the aliases through is that it makes this PEP less disruptive and avoids the need for timely action within build backends, but I can see an argument for being stricter and having the index emit warnings.

(I’ll make the other changes you’ve suggested to the labels/aliases – thanks a ton for collecting this data!)

1 Like

(You’re welcome for the data!)

Yes, I think rather I was imprecise and should have said “normalised” rather than “canonicalised”. For example, the package author may put “Bug tracker” and PyPI ends up showing “Issue tracker”, or they put “history” and PyPI shows “Changelog”.

What happens when a package includes multiple labels of the same category?

For example, Pillow includes both Changelog (a raw list of merged PRs) and Release notes (a human written list of highlights).

1 Like

Today, they’re listed separately with the same icon (which the PEP isn’t aiming to change).

So that raises three points:

  • the PEP should be clear that the extent of the intended normalisation is removing punctuation and converting to lower case, not replacing aliases with a different value (this will likely be clearest as a new “MUST NOT” requirement rather than rewording the positive requirements, but we should check the PEP does say “normalize” everywhere, and never the stronger “canonicalize” )
  • multiple entries in the same category are permitted if it is useful to refer to multiple resources
  • each alias should be listed with its own human readable equivalent. Perhaps these could be given in parentheses after the normalised form in the updated table?
1 Like

Thanks! I’ve updated the current PR to use the language of “normalization” throughout (instead of “canonicalization”), and clarify that aliases shouldn’t be transformed between (e.g. GitHub -> github is OK but github -> source is not).

Yep, per @ncoghlan they’d both continue to be rendered. However, the current proposal makes them both render as “Changelog”, since releasenotes and changelog are currently defined to have the same human-readable equivalent. Given that the two are used in practice to mean different things (sometimes), WDYT about having them be defined with separate table entries?

In that specific case I think separating the table entries makes sense (if people do use both, then the change log/changes/history link should be the comprehensive one and the release notes/what’s new one should be the more user friendly edited one).

In the general case, I’m wondering if publishing tools should be advised to normalise at most one entry in each label category, and leave any additional entries in the same category denormalised for as-supplied display by repository servers. Otherwise there’s no clear way to avoid ambiguity in UI code when converting pattern matched aliases back to a denormalised form.

2 Likes

In the general case, I’m wondering if publishing tools should be
advised to normalise at most one entry in each label category, and
leave any additional entries in the same category denormalised for
as-supplied display by repository servers. Otherwise there’s no
clear way to avoid ambiguity in UI code when converting pattern
matched aliases back to a denormalised form.

Agreed, similarly some projects have separate URLs/platforms for defect tracking vs task management vs help requests while others consider all of those to just be “issues” with no clear distinction.

1 Like

In re-reading the PEP, it makes sense to me specifically for PyPI internals to deprecate specific columns on a database record in favor of a relationship to URLs.

Worth resurfacing Fix how Warehouse stores metadata (per-file, not per-release) · Issue #8090 · pypi/warehouse · GitHub (and associated backfill job) as well - since there’s relevancy to both what we accept during upload, and what we store in the DB, and what we end up displaying to the end-user or emit via APIs - right now it’s “first one uploaded wins”.

But as long at the metadata spec and packaging library do the right thing, we should be able to deprecate those fields on PyPI-side.

4 Likes

It’s been a bit over a week since last activity here, and I (think) I’ve addressed all of the feedback since then (please correct me if I haven’t).

As such, I’m hereby requesting approval from @pf_moore as PEP delegate :slightly_smiling_face:

1 Like

I have a couple of points of clarification which don’t appear to be addressed by the PEP, before I make a decision.

  1. Regarding the question about multiple labels of the same category, the PEP still isn’t sufficiently clear here. I’d like to see it stated explicitly that it is allowed for a project to have multiple URL labels that normalise the same, or normalise to aliases that refer to the same category. And that if this happens, indexes MUST/WILL/SHOULD render the labels for the different URLs identically (or is it OK to render “sourcecode” as “Source Code” and “github” as “Source Code - GitHub” in order to disambiguate?)
  2. I don’t see it mentioned anywhere, so maybe this is a new question I’ve just thought of, but it’s very much in line with the other concerns raised about the normalisation process. If I’m creating a project, and I define project URLs for “Issue Tracker” and “Pull Requests”, then these will be normalised in the project metadata to “issuetracker” and “pullrequests”. The former will get displayed by PyPI as “Issue Tracker”, as it’s a well-known alias, but the latter won’t - and worse, my original intended capitalisation and punctuation/spacing is lost, so PyPI has no way to respect my intention for the URL name. This is worse if I have “Issues” and “PRs”, which gets displayed as “Issue Tracker” and “prs”, losing not only the formatting, but also the fact that both URL names have a similar form. IMO, the project author’s intended spelling of the URL name should be preserved in the project metadata, so that it’s at least possible to recover it, if a tool or index should want to.
  3. You state that the list of well-known names will be maintained in the PyPI documentation. I disagree - this is a standard, not restricted to PyPI, and the list should be maintained in the packaging standards area on packaging.python.org, where end users will expect to find it. The PyPI documentation can link to the standard (and document any PyPI-specific variations, if necessary).
4 Likes

Thanks, I’ll clarify this – the intent was that multiple labels can normalize to the same label, but that normalized labels don’t cross alias boundaries (this latter part is already explicit in the PEP). In other words: Source and SOURCE both normalize to source (and both can be present), but GitHub and Github both normalize to github, and multiple can always be present.

The PEP currently has a MAY for the index’s rendering behavior:

Indices MAY choose to use the human-readable equivalents suggested above in their UI elements, if appropriate. Alternatively, indices MAY choose their own appropriate human-readable equivalents for UI elements.

Would you prefer that the language there be more normative, or more specific? For example, I could add an example saying something like “the index MAY choose to augment the human-readable equivalents, e.g. Source Code - GitHub for github.”

Thank you for calling this out! I agree completely; I’m going to amend the PEP to clarify that the backend tool should emit the normalized label form only if that normalized form corresponds to a well-known label.

In other words: Issue Tracker should be normalized to issuetracker, while PRs should not be normalized at all.

Sounds good, I’ll amend the PEP to reflect that.

I’m still missing something, I think. Or maybe it’s just something that needs to be made more explicit in the PEP.

  1. If I’m understanding you, a user can specify {source = "https://one/thing", SOURCE = "https://another.thing"} in pyproject.toml and the metadata would contain two URLs with the (normalised) label “source”, but with no indication of the fact that they were specified differently in pyproject.toml. Is that right?
  2. When multiple labels in the metadata are the same, or are aliases of each other, indexes may use the same UI element for all of them, or may choose different UI elements. That’s implicit in the fact that the PEP says MAY, but I think some more explicit guidance would help here - is the intention that the URLs are presented as equivalents, or as closely related but different? It matters because a user might want to distinguish between “a zip of the source” and “the project github repo”, and therefore would need to know whether “Source” and “Github” are sufficiently different to make that distinction clear[1].

Thanks. I think that’s sufficient. I still have a nagging feeling there may be edge cases, but I can’t articulate why. I think it’s just my ingrained dislike of not recording the user’s stated preference for how to render the values they provide.

One other point. Assuming this PEP gets accepted, it’ll need to be implemented. Clearly PyPI will implement the rendering side of the PEP, but there’s also work to be done in build backends, specifically to stop emitting Homepage and Download-URL metadata, and to translate the relevant inputs into appropriate Project-URL data. There’s also consumers like pip show (and uv pip show). How do you anticipate that work happening? Because there’s so little mandatory in the PEP, there’s a significant risk that it just gets mostly ignored by tools. It doesn’t matter that much, but there’s a potential credibility problem if PEPs get approved but then not implemented…


  1. This problem is exacerbated because metadata is immutable, so by the time the developer finds out that PyPI represents the URLs in a way that doesn’t match their intent, it’s too late to change the metadata for that release ↩︎

Yep, that’s right (as currently specified). Are you thinking the two should be preserved as-is and that it’s up to the index instead to normalize for subsequent processing? I think that would work just as well and would arguably make the PEP simpler, in terms of removing the normative recommendation that metadata producers perform any transformations at all. OTOH it kicks the can down the road on actual taming the different ways backends can produce valid Project-URL fields :slightly_smiling_face:

Hmm, great point. I can add some guidance language emphasizing that indices probably will want to distinguish between different aliases in terms of visual presentation.

(Thinking more broadly: maybe aliases were the wrong abstraction here? It’s really more of a flat list of well-known labels, each of which has a reasonable human-readable equivalent.)

Understandable :slightly_smiling_face: – going back to my response to (1) above, I think this PEP would be equally “potent” in terms of index behavior if it stipulated normalization at only the index layer, for processing/rendering purposes. That would avoid the need to change the user’s stated preference at the metadata level, while accomplishing the goal at the index level.

In terms of that, I think the question is whether this PEP should be a lateral movement (“make the index more consistent in terms of how it processes URLs”) or an incremental step (“make the packaging ecosystem slightly more consistent/rigid about how it processes metadata”). I think it could go either way, so I’m curious what your preference is as someone who’s more involved in the client side of things.

I’m happy to put that work in, at least for the packaging, metadata, pip, etc. side of things. I’m also happy to liaise with/contribute on the uv side of things, assuming they want my help – maybe @konstin has opinions there?

(This also ties into the above though – if this PEP changes to more of an index-only direction, then clients need to do nothing.)

That’s a fair point. I think that, as long as you’ve considered the options I’m fine with what you’ve chosen - after all, it’s your PEP so you get to decide :slightly_smiling_face:

I’m in favour of incrementally improving the underlying metadata. It’s a bit harder, but improving the foundations pays off much better in the long term, IMO.

Thanks. As long as it’s not forgotten about, that’s the main thing IMO.

I think that uv pip show is currently a limited version that doesn’t include project URLs, so there may well be nothing to do yet on the uv side.

Pip does display project URLs, though, so there will almost certainly be work there, of a similar nature to what PyPI does, in fact. And actually, the fact that pip show pkg renders project URL metadata demonstrates that this PEP was never going to be just an index feature. Client presentation comes up elsewhere as well.

1 Like

Reading @pfmoore’s questions, I was going to make this same suggestion - having the project’s suggested text rendering in the metadata seems both less ambiguous and genuinely more useful than having the normalised forms. The fact this is the status quo makes the option even more attractive.

2 Likes