Draft PEP: Recording provenance of installed packages

Hi all,

based on discussions in pip installation reports and Pre-PEP Recording provenance of installed packages, sharing with you a draft PEP that proposes storing provenance of installed packages when packages are consumed from a Python package index (besides direct_url.json as stated in PEP-610). This draft PEP is supported with changes done to pip to demonstrate the proposed functionality.

Please review the proposal (happy for any comments). If there would not be any objections, I plan to submit this as a PEP on March 27th.

Thank you.

Draft PEP
PEP: 9999
Title: Recording provenance of installed packages
Author: FridolĂ­n PokornĂ˝ <fridolin.pokorny at gmail.com>
Sponsor: Donald Stufft <donald@stufft.io>
PEP-Delegate: Paul Moore <p.f.moore@gmail.com>
Discussions-To: https://discuss.python.org/t/pep-705-recording-provenance-of-installed-packages/23340
Status: Draft
Type: Process
Content-Type: text/x-rst
Created: 09-Mar-2023
Post-History:

Abstract
========

This PEP describes a way to record provenance of Python distributions
installed.  The record is created by an installer and is available to users in
a form of a JSON file ``provenance_url.json`` in ``.dist-info`` directory. The
mentioned JSON file captures additional metadata to allow recording a URL to a
Python distribution together with the installed Python distribution hash. This
proposal is built on top of :pep:`610` following `its corresponding canonical
PyPA spec
<https://packaging.python.org/en/latest/specifications/direct-url/>`__ and
complements ``direct_url.json`` with ``provenance_url.json`` file when packages
are identified by a name, and either a version.

Motivation
==========

Installing a Python package involves downloading a distribution from an index
and extracting its content to an appropriate place. After the installation
process is done, information about the distribution used as well as its source
is generally lost. Nevertheless, there are use cases for keeping records of
distributions used for installing packages and their provenance.

Python wheels can be built with different compiler flags or supporting
different wheel tags.  In both cases, users might get into a situation in which
multiple wheels might be considered by installers (possibly from different
package indexes) and immediately finding out which wheel file was actually used
during the installation might be helpful. This way, developers can use
information about wheels to debug issues making sure the desired wheel
was actually installed. Another use case could be tools reporting software
installed, such as tools reporting SBOM (Software Bill of Material), that might
give more accurate reports.

The motivation described in this PEP is an extension to :pep:`610`.  Besides
stating information about packages installed using a direct URL, installers SHOULD
record information also for packages installed from Python package indexes when
identified by their name, and optionally their version.

Specification
=============

The ``provenance_url.json`` file SHOULD be created in the ``*.dist-info``
directory by installers when installing a distribution identified by their
name, and optionally their version specifier.

This file MUST NOT be created when installing a distribution from a requirement
specifying a direct URL reference (including a VCS URL).

Only one of ``provenance_url.json`` and ``direct_url.json`` from :pep:`610`
files MAY be present in ``*.dist-info`` directory.

The ``provenance_url.json`` JSON file MUST be a dictionary, compliant with
:rfc:`8259` and UTF-8 encoded.

If present, it MUST contain exactly two keys. The first one is ``url``, with
type ``string``.  The second key MUST be ``archive_info`` with a value defined
below.

The ``url`` field MUST state a URL to the installed distribution. If a wheel is
built from a source distribution, the ``url`` field MUST point to the used
source distribution. On the other hand, when a wheel is installed, the
``url`` field MUST keep a URL of the installed wheel. Following :pep:`610`, the
``url`` field MUST be stripped of any sensitive authentication information, for
security reasons.

The user:password section of the URL MAY however be composed of environment
variables, matching the following regular expression::

    \$\{[A-Za-z0-9-_]+\}(:\$\{[A-Za-z0-9-_]+\})?

Additionally, the user:password section of the URL MAY be a well-known,
non-security sensitive string. A typical example is ``git`` in the case of an
URL such as ``ssh://git@gitlab.com``.

The value of ``archive_info`` MUST be a dictionary with a single key
``hashes``.  The ``hashes`` key is a dictionary mapping a hash name to a
hex-encoded digest of the file referenced by the ``url`` field. Multiple hashes
can be included, and it is up to the consumer to decide what to do with
multiple hashes (it may validate all of them or a subset of them, or nothing at
all).

Each hash MUST be one of the single argument hashes provided by
``hashlib.algorithms_guaranteed`` except for ``sha1`` and ``md5`` hashes. At
the time of writing this PEP, the listing does not include multi-argument
hashes ``shake_128`` and ``shake_256``:

.. code-block:: python

  >>> import hashlib
  >>> sorted(hashlib.algorithms_guaranteed - {"shake_128", "shake_256", "sha1", "md5"})
  ['blake2b', 'blake2s', 'sha224', 'sha256', 'sha384', 'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512', 'sha512']

Each hash MUST be referenced by the canonical name of the hash, always lower case.

Hashes ``sha1`` and ``md5`` MUST NOT be present, respecting security
limitations of these hash algorithms. On the other hand, hash ``sha256`` SHOULD
be included.

Installers that cache installed distributions from an index SHOULD keep
information related to the cached distribution, so that
``provenance_url.json`` file can be created even when installing distributions
from installer's cache.

Examples
========

Examples of a valid provenance_url.json
---------------------------------------

A valid ``provenance_url.json`` stating multiple hashes:

.. code:: json

  {
    "archive_info": {
      "hashes": {
        "blake2s": "fffeaf3d0bd71dc960ca2113af890a2f2198f2466f8cd58ce4b77c1fc54601ff",
        "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
        "sha3_256": "c856930e0f707266d30e5b48c667a843d45e79bb30473c464e92dfa158285eab",
        "sha512": "6bad5536c30a0b2d5905318a1592948929fbac9baf3bcf2e7faeaf90f445f82bc2b656d0a89070d8a6a9395761f4793c83187bd640c64b2656a112b5be41f73d"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }

A valid ``provenance_url.json`` stating a single hash entry:

.. code:: json

  {
    "archive_info": {
      "hashes": {
        "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }

A valid ``provenance_url.json`` stating a source distribution which was used to
build and install a wheel:

.. code:: json

  {
    "archive_info": {
      "hashes": {
        "sha256": "8bfe29f17c10e2f2e619de8033a07a224058d96b3bfe2ed61777596f7ffd7fa9"
      }
    },
    "url": "https://files.pythonhosted.org/packages/1d/43/ad8ae671de795ec2eafd86515ef9842ab68455009d864c058d0c3dcf680d/micropipenv-0.0.1.tar.gz"
  }

Examples of an invalid provenance_url.json
------------------------------------------

The following example includes ``hash`` key in the ``archive_info`` dictionary
as originally designed in :pep:`610` and the data structure documented in [3]_.
The ``hash`` key MUST NOT be present to prevent from any possible confusion
with ``hashes`` and additional checks that would be required to keep hash
values in sync.

.. code:: json

  {
    "archive_info": {
      "hash": "sha256=236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
      "hashes": {
        "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }

Another example demonstrates an invalid hash name. The referenced hash does not
correspond to canonical hash name described in this PEP and `Python docs
<https://docs.python.org/3/library/hashlib.html#hashlib.hash.name>`__.

.. code:: json

  {
    "archive_info": {
      "hashes": {
        "SHA-256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }


Example pip commands and their effect on provenance_url.json and direct_url.json
--------------------------------------------------------------------------------

Commands that generate a ``direct_url.json`` file but do not generate
```provenance_url.json`` file. These examples follow examples from :pep:`610`:

* ``pip install https://example.com/app-1.0.tgz``
* ``pip install https://example.com/app-1.0.whl``
* ``pip install “git+https://example.com/repo/app.git#egg=app&subdirectory=setup”``
* ``pip install ./app``
* ``pip install file:///home/user/app``
* ``pip install –editable "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"`` (in which case, ``url`` will be the local directory where the git repository has been cloned to, and ``dir_info`` will be present with ``"editable": true`` and no ``vcs_info`` will be set)
* ``pip install -e ./app``

Commands that generate a ``provenance_url.json`` file but do not generate
``direct_url.json`` file:

* ``pip install app``
* ``pip install app~=2.2.0``
* ``pip install app –no-index –find-links "https://example.com/"``

This behaviour can be tested using changes to pip introduced in [1]_.

Rejected Ideas
==============

Naming the file direct_url.json instead of provenance_url.json
--------------------------------------------------------------

To preserve backwards compatibility with :pep:`610`, the file cannot be named
``direct_url.json`` (from :pep:`610`):

  This file MUST NOT be created when installing a distribution from an other
  type of requirement (i.e. name plus version specifier).

The change might introduce backwards compatibility issues for consumers of
``direct_url.json`` who rely on its presence only when distributions are
installed using a direct URL reference.

Deprecate direct_url.json and use only provenance_url.json
----------------------------------------------------------

File ``direct_url.json`` is already well established in :pep:`610` and is
already used by installers. For example, ``pip`` uses ``direct_url.json`` to
report a direct URL reference on ``pip freeze``. Deprecating
``direct_url.json`` would require additional changes to the ``pip freeze``
implementation in pip (see [2]_) and could introduce backwards compatibility
issues for already existing ``direct_url.json`` consumers.

Keeping the hash key in the archive_info dictionary
---------------------------------------------------

:pep:`610` and `its corresponding canonical PyPA spec
<https://packaging.python.org/en/latest/specifications/direct-url/>`__ discuss
the possibility to state ``hash`` key alongside the ``hashes`` key in the
``archive_info`` dictionary. This PEP explicitly discards the ``hash`` key in
the ``provenance_url.json`` file and expects only ``hashes`` key to be present.
By doing so we eliminate possible redundancy in the file, possible confusion,
and any additional checks that would need to be done to make sure hashes are in
sync.

Making the hashes field optional
--------------------------------

:pep:`610` and `its corresponding canonical PyPA spec
<https://packaging.python.org/en/latest/specifications/direct-url/>`__
recommend stating the ``hashes`` field of the ``archive_info`` in the
``direct_url.json`` file but allows ignoring it under certain circumstances
following :rfc:`2119`:

  A hashes key SHOULD be present as a dictionary mapping a hash name to a hex
  encoded digest of the file.

This PEP enforces availability of the ``hashes`` field of the ``archive_info``
in the ``provenance_url.json`` file if ``provenance_url.json`` file is created:

  The value of ``archive_info`` MUST be a dictionary with a single key
  ``hashes``.

By doing so, consumers of ``provenance_url.json`` file can perform check on
artifact digests when ``provenance_url.json`` file is created by installers.

Open Issues
===========

Availability of the provenance_url.json file in Conda
-----------------------------------------------------

We would like to get feedback on the ``provenance_url.json`` file by Conda
maintainers or developers. It is not clear whether Conda would like to adopt
the ``provenance_url.json`` file.

Using provenance_url.json in downstream installers
--------------------------------------------------

The proposed ``provenance_url.json`` file was meant to be adopted primarily by
Python installers. Other installers, such as apt or dnf, might record
provenance of the installed downstream Python distributions in their specific
way that can be specific to downstream package management. The proposed file is
not expected to be created by these downstream package installers and thus they
were intentionally left out of this PEP. However, any input by developers or
maintainers of these installers is valuable to possibly enrich the
``provenance_url.json`` file with information that would help in some way.

Backwards Compatibility
=======================

Since this PEP specifies a new file in the ``*.dist-info`` directory, there are
no backwards compatibility implications to consider in the ``provenance_url.json``
file itself. Also, this proposal does not make any changes to the
``direct_url.json`` described in :pep:`610` and `its corresponding canonical
PyPA spec
<https://packaging.python.org/en/latest/specifications/direct-url/>`__.

The content of ``provenance_url.json`` file was designed in a way to eventually
allow installers reuse some of the logic supporting :pep:`610` when a
direct URL refers to a source archive or a wheel.

References
==========

The following changes were done to pip to support this PEP:

.. [1] `A patch to pip introducing provenance_url.json as discussed in this PEP
  <https://github.com/fridex/pip/pull/1/>`__

.. [2] `Changes to pip to support the decision for creating
  provenance_url.json instead of stating provenance in already existing
  direct_url.json <https://github.com/fridex/pip/pull/2/>`__

.. [3] `Direct URL Data Structure
  <https://packaging.python.org/en/latest/specifications/direct-url-data-structure/>`__

Acknowledgements
================

Thanks to Dustin Ingram, Brett Cannon, Paul Moore for the initial discussion in
which this idea originated.

Thanks to Donald Stufft, Ofek Lev, and Trishank Kuppusamy for early feedback
and support to work on this PEP.

Thanks to Gregory P. Smith and Stéphane Bidoul for reviewing this PEP and
providing valuable suggestions.

Thanks to Stéphane Bidoul and Chris Jerdonek for :pep:`610`.

Last, but not least, thanks to Donald Stufft for sponsoring this PEP.

Copyright
=========

This document is placed in the public domain or under the CC0-1.0-Universal
license, whichever is more permissive.
2 Likes

A rendered version of the draft PEP can be found on GitHub.

Thanks for the PEP! I left review comments on your github PR.

1 Like

Thank you for the review. All the comments should be addressed now. Please feel free to raise any other concerns or comments regarding the PEP.

I’m assuming I’ll be PEP delegate for this. Could we please ensure that any discussions on the PEP (other than purely editorial issues like typos) happen here, and not on the PR? I will not be reviewing the PR comments, and even if I did, I find it difficult to follow the discussion given that changes are getting force-pushed, meaning a lot of things are marked as “outdated”.

Also, you still don’t appear to be following the normal process. I see that @dstufft is the PEP sponsor, I suggest having a word with him to ensure that you’re following the correct process. I would have expected a draft PEP to come out of an initial discussion on Discourse, and the PEP to link to that initial discussion for background. This PEP appears to have sprung out of nowhere, with no initial discussion or even mention on Discourse. Also, you need to get the PEP committed and assigned a number before even starting to talk about submitting it for approval, and as was explained in the previous thread

I’ll add some further comments on the proposal itself in a follow-up message.

As I said, that’s not normally how the process works for packaging PEPs. There should be a discussion, and once there’s a clear consensus that the idea is good, then the PEP should be submitted. If no-one comments on the PEP, and you submit it as you propose, then I’ll probably reject it on the basis that no-one seems interested in it.

The initial thread on this PEP had no comments on the actual proposal, just on the process. This thread has so far only had your comments, and the note from @gpshead. The fact that no-one else has commented either for or against the proposal concerns me. Packaging proposals are never this quiet :slightly_smiling_face:

1 Like

I reckon that folks are waiting on this merging in the PEPs repository before engaging in a heavier discussion on this.

2 Likes

That’s what I’m doing. I left some grammatical comments and one follow-up comment on something that initially caught my eye, but otherwise I’m waiting until it has a PEP number since it already has a sponsor.

1 Like

Some comments on the proposal itself.

The motivation section feels weak to me - there’s a lot of examples of things that “might” be useful, but no actual examples where it will be used. If there’s no actual use cases, how do we know that the design is right? And how do we know that we’re not just adding complexity which might never get used? Do we have examples of people asking for this information, that can be linked to in the PEP?

It’s not at all clear to me how the “url” field will be filled in. What URL is appropriate for (for example) pip install ./app? It seems pointless recording the source directory, the user could rename or delete it. Or simply change the content - which makes the link useless for provenance purposes. And what should the installer do if (for example) it installs a cached copy of a wheel built from source? The cached wheel is identified purely by name/version, so there’s no guarantee that it came from the same location as would have been chosen if the cached file wasn’t there. Without good examples of real use cases, it’s impossible to guess what a “reasonable” answer is to these types of question.

You say there’s no backwards compatibility issue because it’s a new file. That’s not entirely accurate - it’s also important that the file is optional - otherwise, consumers would be impacted as they would have to be prepared to deal with “legacy” installations that have neither provenance_url.json nor direct_url.json. As it is, there’s no backward compatibility issue, but there is a usability issue - if the data available is by definition potentially incomplete, is that sufficient? Once again, without use cases, it’s impossible to say. (I imagine, for example, that filing SBOM paperwork that states that the information provided might be incomplete because there’s no requirement for the tools to have recorded everything, might be an issue…)

You’ve provided a reference implementation for pip (although I haven’t looked at it in any detail). What about other installers like installer? Or conda (my understanding is that conda tends to implement most of the “installed database of packages” standards, for compatibility if nothing else? As a standard, the expectation is that all installers will implement this proposal. Have other installer projects confirmed that they are OK with this? For that matter, has pip? In the same way as the PEP, the pip PR seems to have appeared, but had no feedback or comment from the pip maintainers, so I don’t think the mere existence of the PR implies that pip is willing to implement this PEP. This is another place where getting supportive feedback is important to getting the proposal approved.

That’s fair, but on that basis, the idea of setting a deadline for when the PEP will be submitted for approval seems even less reasonable… (It was the deadline that prompted me to comment, otherwise I would also have ignored this until it was merged).

It could definitely be useful for SBOMs, and along those lines it would allow for pip freeze (or similar) to generate a very accurate lock file-like output for the current environment. For instance, if we ever get a lock file format there won’t be a way to take an environment that already exists and work backwards to help generate that initial lock file.

1 Like

I don’t think there’s anything wrong with having a draft PEP to kick off discussion. PEP 1 certainly doesn’t require you to wait to write a draft, while it does suggest doing it, that suggestion’s purpose is, as documented in PEP 1, to save the author the time of writing a PEP if the idea is going to be rejected anyways rather than being an inherent part of the process [1].

In this case, I think it would be silly to pretend there wasn’t a draft PEP already written simply to blindly follow some idealized process, the document was already written, it seems entirely reasonable to include it in the discussion about whether this idea should progress further or not.

Fridolin wasn’t saying he was going to submit it for approval, he said he was going to submit it as a PEP to get a number assigned.

I told Fridolin that I frequently express when I intend to do something when I plan to submit some PEP, because IME unless someone is extremely against something they’ll often times put it on their list to comment on later, and then get surprised when I submit something or do something before they thought I was going to.

Thus I recommended mentioning when he intended to submit for a PEP number, assuming of course that there wasn’t ongoing discussion at that point or that the discussion hadn’t changed those plans, so that nobody is surprised by it, and it tends to seem to push people to comment more often IME.

I haven’t had a chance to comment yet, but I think this proposal is a reasonable idea that improves the auditability of an existing installation to know where what came from. I have longer form thoughts on it that I don’t have time to get into at the moment, but I’m +1 on the overall idea.


  1. To be honest, a great many PEPs go straight to a PEP without much if any “pre PEP” discussion. It feels like we’re holding this particular PEP to a different standard than we hold any other PEP that I’ve ever been involved with. ↩︎

3 Likes

The “deadline” (which wasn’t really a deadline, just Fridolin communicating up front when he intended to do something unless the current discussion altered that plan) was for submitting the PEP to get a PEP number assigned.

2 Likes

Same, FWIW.

2 Likes

OK. That explains the confusion. I suspect there won’t be much discussion before then, in that case. People have already pointed out that they prefer to wait for a PEP number.

I generally think this is a useful idea. It enables the “pip freeze with hashes” use case that is sometimes requested on the pip tracker.

I have not had time to review in depth, but one thing I would like to suggest is to reference the direct URL data structure instead of re-specifying it in the PEP with possible subtle differences that would make its generation and consumption harder for implementers. To enable that, I had submitted a PR to packaging.python.org to have the data structure specification in a standalone document so it is easier to reference independently of the direct_url.json / PEP 610 context.

1 Like

Forgot to ask, I assumed you’ll be the PEP delegate, but just wanted to get your :stamp: on being that rather than just assume it!

Yep, that’s fine with me.

2 Likes

I have a related question (sorry for the ignorance, probably this is more related to PEP 610 itself).

Imagine a scenario where a package is build from a transient local directory, that will be removed from the machine after the installation is completed (let’s think for example in a backend bootstrapping scenario[1]).

According to the text in the PEP, this process would have only to create a direct_url.json file but no provenance_url.json, right? Is the direct_url.json required even if the local directory will be deleted later and will not be available for anyone reading direct_url.json?

Will it be OK for a package to not have neither direct_url.json nor provenance_url.json? (Will installers like pip infer anything in this circumstance and avoid updating the package?).


  1. I have the impression that this is similar what can be done using flit_core.boostrap_install to bootstrap a Python “dev-ready” environment in a Linux distribution or other OS. ↩︎

Just chiming in that I generally like the idea as well and am also waiting for this to be merged into the PEP repo as a draft before providing a detailed review.

2 Likes

This case should be covered by PEP-610. The proposed draft PEP is addressing what is missing in PEP-610, which is installing packages by their names, and optionally their version specifiers:

The provenance_url.json file SHOULD be created in the *.dist-info
directory by installers when installing a distribution identified by their
name, and optionally their version specifier.

This file MUST NOT be created when installing a distribution from a requirement
specifying a direct URL reference (including a VCS URL).

Only one of provenance_url.json and direct_url.json from :pep:610
files MAY be present in *.dist-info directory.

There is also section Example pip commands and their effect on provenance_url.json and direct_url.json. pip creates direct_url.json file for this case as of today - it follows what is stated in PEP-610:

$ pip install ./app
...
$ cat .venv/lib/python3.10/site-packages/foo-1.0.0.dist-info/direct_url.json | jq
{
"dir_info": {},
"url": "file:///Users/fridolin.pokorny/git/fridex/pip/app"
}
$ cat .venv/lib/python3.10/site-packages/foo-1.0.0.dist-info/provenance_url.json
cat: .venv/lib/python3.10/site-packages/foo-1.0.0.dist-info/provenance_url.json: No such file or directory

This should be addressed in this comment. I’ve adjusted the draft PEP to explicitly state this case.

Based on the draft PEP the proposed provenance_url.json:

The provenance_url.json file SHOULD be created in the *.dist-info
directory by installers when installing a distribution identified by their
name, and optionally their version specifier.

Is SHOULD sufficient here to avoid the mentioned compatibility issues and keeping adoption by other installers optional?


I will not be reviewing the PR comments, and even if I did, I find it difficult to follow the discussion given that changes are getting force-pushed, meaning a lot of things are marked as “outdated”.

I tried to keep git history clean, hence force pushing - I did changes between meetings and sometimes with typos or with merge conflicts. If someone would like to review only commits, they should state relevant diff and comments. BTW, GitHub marks comments outdated because they were added to a line that no longer exist because of subsequent changes (example).