PEP 658: Static Distribution Metadata in the Simple Repository API

This is a PEP to add data-dist-info-metadata to provide a structure for Simple Repository API servers to expose the METADATA file in a wheel (or sdist, if they are specified in the future to provide static distribution metadata with it).

The issue on pypa/warehouse that prompted this PEP also contains some discussion that should be relevant for people interested in designing this.

UPDATE: The rendered version is now available at PEP 658 -- Static Distribution Metadata in the Simple Repository API | Python.org

PEP: 658
Title: Static Distribution Metadata in the Simple Repository API
Author: Tzu-ping Chung <uranusjr@gmail.com>
Sponsor: Brett Cannon <brett@python.org>
PEP-Delegate: Donald Stufft <donald@stufft.io>
Discussions-To: https://discuss.python.org/t/8651
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 10-May-2021
Post-History: 10-May-2021
Resolution:


Abstract
========

This PEP proposes adding an anchor tag to expose the ``METADATA`` file
from distributions in the :pep:`503` "simple" repository API. A
``data-dist-info-metadata`` attribute is introduced to indicate where
the file from a given distribution can be independently fetched.


Motivation
==========

Package management workflows made popular by recent tooling increase
the need to inspect distribution metadata without intending to install
the distribution, and download multiple distributions of a project to
choose from based on their metadata. This means they end up discarding
much downloaded data, which is inefficient and results in a bad user
experience.


Rationale
=========

Tools have been exploring methods to reduce the download size by
partially downloading wheels with HTTP range requests. This, however,
adds additional run-time requirements to the repository server. It
also still adds additional overhead, since a separate request is
needed to fetch the wheel's file listing to find the correct offset to
fetch the metadata file. It is therefore desired to make the server
extract the metadata file in advance, and serve it as an independent
file to avoid the need to perform additional requests and ZIP
inspection.

The metadata file defined by the Core Metadata Specification
[core-metadata]_ will be served directly by repositories since it
contains the necessary information for common use cases. The metadata
served must be completely static, i.e. identical to the ``METADATA``
file in the ``.dist-info`` directory [dist-info]_ if the distribution
is installed. The repository can provide this for any distributions,
but it is expected they will only provide them for wheels [wheel]_
at the current time, since an sdist [sdist]_ does not yet have a way
to promise the metadata will stay the same after it is built.

Since not all distributions have static metadata, an HTML attribute
on the distribution file's anchor link is needed to indicate whether a
client is able to choose the separately served metadata file instead.
The attribute is also used to provide the metadata file's hash, so
clients can verify the file after download. If the attribute is
missing from an anchor link, static metadata is not available for the
distribution, either because of the distribution's content, or lack of
repository support.


Specification
=============

In a simple repository's project page, each anchor tag pointing to a
distribution **MAY** have a ``data-dist-info-metadata`` attribute. The
presence of the attribute indicates the distribution represented by
the anchor tag **MUST** contain a Core Metadata file that will not be
modified when the distribution is processed and/or installed.

If a ``data-dist-info-metadata`` attribute is present, the repository
**MUST** serve the distribution's Core Metadata file alongside the
distribution with a ``.metadata`` appended to the distribution's file
name. For example, the Core Metadata of a distribution served at
``/files/distribution-1.0-py3.none.any.whl`` would be located at
``/files/distribution-1.0-py3.none.any.whl.metadata``. This is similar
to how :pep:`503` specifies the GPG signature file's location.

The repository **SHOULD** provide the hash of the Core Metadata file
as the ``data-dist-info-metadata`` attribute's value using the syntax
``<hashname>=<hashvalue>``, where ``<hashname>`` is the lower cased
name of the hash function used, and ``<hashvalue>`` is the hex encoded
digest. The repository **MAY** use ``true`` as the attribute's value
if a hash is unavailable.


Backwards Compatibility
=======================

If an anchor tag lacks the ``data-dist-info-metadata`` attribute,
tools are expected to revert to their current behaviour of downloading
the distribution to inspect the metadata.

Older tools not supporting the new ``data-dist-info-metadata``
attribute are expected to ignore the attribute and maintain their
current behaviour of downloading the distribution to inspect the
metadata. This is similar to how prior ``data-`` attribute additions
expect existing tools to operate.


Rejected Ideas
==============

Put metadata content on the project page
----------------------------------------

Since tools generally only need to dependency information from a
distribution in addition to what's already available on the project
page, it was proposed that repositories may directly include the
information on the project page, like the ``data-requires-python``
attribute specified in :pep:`503`.

This approach was abandoned since a distribution may contain
arbitrarily long lists of dependencies (including required and
optional), and it is unclear whether including the information for
every distribution in a project would result in net savings since the
information for most distributions generally ends up unneeded. By
serving the metadata separately, performance can be better estimated
since data usage will be more proportional to the number of
distributions inspected.


Expose more files in the distribution
-------------------------------------

It was proposed to provide the entire ``.dist-info`` directory as a
separate part, instead of only the metadata file. However, searving
multiple files in one entity through HTTP requires re-archiving them
separately after they are extracted from the original distribution
by the repository server, and there are no current use cases for files
other than ``METADATA`` when the distribution itself is not going to
be installed.

It should also be noted that the approach taken here does not
preclude other files from being introduced in the future, whether we
want to serve them together or individually.


Explicitly specify the metadata file's URL on the project page
--------------------------------------------------------------

An early version of this draft proposed putting the metadata file's
URL in the ``data-dist-info-metadata`` attribute. But people feel it
is better for discoverability to require the repository to serve the
metadata file at a determined location instead. The current approach
also has an additional benefit of making the project page smaller.


References
==========

.. [core-metadata] https://packaging.python.org/specifications/core-metadata/

.. [dist-info] https://packaging.python.org/specifications/recording-installed-packages/

.. [wheel] https://packaging.python.org/specifications/binary-distribution-format/

.. [sdist] https://packaging.python.org/specifications/source-distribution-format/


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
9 Likes

Can/should we allow/require the use of hash fragments (e.g. #sha256=...) on the URLs so that the metadata file can be verified to be at least as trustworthy as the index (since there’s a specific mention that they could be hosted separately)?

1 Like

Good call, that’s definitely needed, I’ll add a SHOULD sentence for it.

1 Like

The PEP is also currently looking for a sponsor to be merged. (I came here to post this originally but wanted to respond to the previous comment first.)

One positive of having the PEP setup a standard location for the METADATA file, is that it keeps the /simple/ API responses smaller. I’m not sure off hand if that is a meaningful savings or not, it would be something like 30 bytes per file versus 128+ bytes per file on PyPI (128 bytes would be a random requests URL I just grabbed, other files would have more or less assuming we store metadata with a similar URL scheme as we do files). Add ~70 bytes to both for a sha256 hash.

I don’t feel super strongly one way or the other, I just wanted to mention that there is at least some benefit to doing a PEP defined location, whether that’s enough of a benefit or not I’ll let @uranusjr decide :slight_smile:

1 Like

I lean toword statically specified location for discoverability. It’d make life easier when diagnosing stuff and you need to look up the metadata of a wheel.

Instead of looking into a data key in the simple API and opening that, it’d be a case of appending “.metadata" to the URL and getting there. The former seems like too much work (3 steps), and the later is much easier to do for a human (1 step).

Agreed. Like @dstufft I don’t have a strong opinion, but what @pradyunsg says makes sense, and I don’t think anyone’s said they have a specific need to store the metadata files elsewhere.

Instead of looking into a data key in the simple API and opening that,
it’d be a case of appending “.metadata" to the URL and getting there.
The former seems like too much work (3 steps), and the later is much easier
to do for a human (1 step).

IMHO this can (and should) be taken care by the web UI. That being said,
personally I also feel that

I lean toword statically specified location for discoverability.

But it is for a different reason: the simplicity of setting up
a simple API repository. At the moment, python -m http.server
or the alike can serve it. Well, without data-requires-python
already, which kinda bugs me. Full disclosure: I’m working on
a downstream wheel repository on IPFS, whose HTTP gateway
conveniently comply with PEP 503.

How about exposing under a different API?
E.g. /complex/x/x-42-py4-none-any.whl/x-42.dist-info/METADATA

This will allow us to expose other files in the future
without worrying about PEP 503 backward compatibility.

This won’t work since you can’t have both x-42-py4-none-any.whl as a file and a directory. I would definitely be happiest if we can come up with something like this though, the .metadata suffix really bugs me for some reason.

That why I mentioned under a different API. If we’re to have
a *.metadata entry, then we need to have something to distinguish it
from the distribution packges entry anyway and I feel that making
the simple API no longer simple.

There are many levels between http.server and a fully dynamic server. For example, an index can be built by pre-generating index.html (similar to how static site generators work) and serve the directories with nginx. x-42-py4-none-any.whl/x-42.dist-info/METADATA will only work on servers with fully dynamic route resolution, which IMO is going too far.

3 Likes

I’d argue that generating index.html for extracted tree like
x-42-py4-none-any.whl/x-42.dist-info/METADATA can be simpler
and more deterministic to wrap one’s head around, e.g. the following
shell snippet will extract and generate the index.html
for every subdirectory:

#!/bin/sh
mkdir -p $2
unzip $1 -d $2
for d in $(find $2 -type d)
do
  index=$(mktemp)
  echo "<!DOCTYPE html>" > $index
  echo "<html>" >> $index
  for f in $(ls -1 $d)
  do
    echo "<a href=$f>$f</a><br>" >> $index
  done
  echo "</html>" >> $index
  mv $index $d/index.html
done

Of course during writing this I realized that it overrides
any index.html that is part of the distribution, and this is quite
sidetracked from the PEP. Overall I’m happy with where it is going.
I was mainly anticipating that we would have a simpler URI for
the METADATA files instead of what is happening to distributions
on https://files.pythonhosted.org

As for exposing the entire archive’s content, while seems nice I’m yet
to see any practical use without drastically changing how an installer
work. I even doubt if the long description in the metadata file
is needed in any case, but it’s simpler in the way this PEP is proposing.

I fully agree it is, but the problem is it won’t work, because x-42-py4-none-any.whl already needs to be a file (the wheel itself), so you can’t also put a file at x-42-py4-none-any.whl/x-42.dist-info/METADATA unless you resolve x-42-py4-none-any.whl dynamically, making it the URLs impossible to serve by any static HTTP routing logic. I would jump on this scheme in a second if that’s doable.

1 Like

I think I was unclear, I was proposing to have a separate index
for the metadata (and reserve the simple API solely for distributions),
jokingly prefixed by /complex in the previous post d-;

-1 on multiple pages per project. Being able to get all of the links for a project from a single GET request is a useful property of the existing API, and I don’t think this change justifies losing that.

1 Like

I’ve pushed a commit to change the metadata file’s location to be pre-determined (using the .metadata suffix; although I’m still not particularly fond of this name).

Also, a reminder that the PEP still lacks a sponsor to be actually proposed.

1 Like

Would .METADATA be better than .metadata, to more closely match the fact that the file it’s for is METADATA?

And, maybe, does that address the “it’s a weird name” concerns with the name?

2 Likes

I’ll sponsor it. Do make sure to update the CODEOWNERS file for the PEP to list me.

1 Like

The draft has been merged, and the rendered version is now available at

2 Likes

Under Rejected Ideas, I think there’s a missing word in the sentence starting “Since tools generally only need to dependency information…”, as it doesn’t make sense currently.