Increasing pip's/PyPI's metadata strictness (summit followup)

At the PyCon packaging minisummit, maintainers discussed:

Pip and PyPI already or will soon know cases where package metadata is incorrect; how firmly/strictly should metadata correctness be enforced?
There’s general agreement that strictness should be increased; the question is: how quickly and to what extent?

I turned those notes into packaging-problems issue #264 to provide a tracking issue covering the various TODOs necessary to plumb this through the parts of the toolchain. I’m sure I missed stuff; please add things, then I’ll turn bullets into checkboxes and we can make more concentrated progress.

(people mentioned: @pganssle @dstufft @dustin @EWDurbin @crwilcox @ncoghlan @pradyunsg)

If people are looking for opinions mine are:

  • As strict as possible
  • As soon as possible

:smile:

1 Like

Agree with @brettcannon - strict and soon.

Also, it would be helpful to have versioning_scheme, changelog_url in package metadata and pypi. PYPI should have clearer standards on what needs to be populated before listing.

A consistent framework to store and access this info, would create a market signal / expectation that all of this is necessary info…

@surfaceowl is changelog_url really necessary? You can already add arbitrary URLs to the side panel on PyPI as project_urls. What would this be used for?

@agronholm – we already have a number of optional fields on core metadata that are helpful for downstream developers - I think changelogs would be similarly useful.

The main interest is to reduce the friction and work needed to reliably understand the summary of material changes between releases of python packages. Today - everyone does this differently, if at all.

Not everyone uses or reliably adheres to semantic versioning. Most package owners don’t publish a changelog at all… while Python itself, and some package owners publish changelogs, some like [pip] and django call them release notes, some have their summaries generated by sphinx… while others point you to commit history, which creates a lot of unneeded work if there are no material changes. This problem is compounded when a dev has update a large number of packages in a project.

Standard metadata for changelogs would:

  1. make it easier to find breaking and major changes
  2. save time when only minor changes - giving devs an easy way to skip digging for info
  3. create a common pathway to find the info - also a timesaver
  4. signal changelogs are important - improving consistency in the python ecosystem.

Relying only on arbitrary URLs, there were be no market signal to devs that changelogs are valuable - and they would not be populated. That seems to be the case today, as many authors don’t add arbitrary URLs for changelogs, even though they can.

Using a standard name in the package description, but enabling devs to link to whatever they want (e.g. releases or changelogs) has the benefit of not forcing a particular style/naming choice.

1 Like

I think that there are two topics intertwined in your reply: the changelog URL and changelog metadata. Are you proposing standardized changelogs in addition to the changelog_url field in packaging metadata? If so, are there other packaging ecosystems doing the same? Which ones?

Also, my 2 cents on the topic: as strict and as soon as possible.

Is the issue of tools (e.g. PEP 517 backends) generating bad METADATA files in the scope of this issue?

The reason is that I’m wondering how pip should best behave in the scenario of this issue (“pip 19.1.1 fails to install pendulum 2.0.4”): https://github.com/pypa/pip/issues/6566
It appears as if the poetry backend is generating a METADATA file using an encoding different from what setuptools’ pkg_resources uses to read it. So either the METADATA file is bad or there is a bug somewhere else in the chain. If the former, maybe pip can be checking the integrity of the metadata slightly earlier in its process and providing a better message to the user (and in particular directing users more toward the true cause).

Also, along these lines, can someone point me to the specifications saying how the METADATA file should be encoded? I couldn’t seem to find it, even though I saw that setuptools’ code uses utf-8 to decode when running under Python 3.

I agree that we should be as strict as possible, as soon as possible – let’s figure what that concretely means. :slight_smile:

I think the “strictness” enforcement should be more on the PyPI side, i.e. more actively refusing uploads, and pip should grow better error messaging for cases where we want to refuse metadata.

It’ll probably be best to have a shared implementation of these checks so that it is easier to keep tooling “in sync” on this. Ideally, twine check, pip install and PyPI will be backed by the same implementation for checking validity of the metadata.


I looked https://packaging.python.org/specifications/core-metadata/ and I don’t see any statement on encoding of the file there – I have a feeling that the intent was for it to always be UTF-8 but that never actually made it to the standard.

I think it’ll be a good idea.


@surfaceowl @agronholm the conversation for enhancing the metadata is separate from being more strict in handling current metadata. I think it would be best to start a new thread for it. =)

1 Like

I couldn’t find any statement either, even when looking at the PEPs themselves. The only relevant thing I could find was in RFC 822, which is referenced by a number of the PEPs. That RFC seems to say that only ascii characters are allowed. I wonder if the PEPs should be clarified or amended regarding the encoding and also say how non-ascii characters are meant to be supported (given that RFC 822 by itself doesn’t seem to support it).