PEP 710 - Recording the provenance of installed packages

Hi all,

following the discussion in Draft PEP: Recording provenance of installed packages, sharing PEP 710 with you.

The mentioned PEP 710 can be viewed online. Attaching also PEP below.

The PEP was written thanks to discussions in:

The PEP states open questions, mostly around installers - Conda and downstream package installers. There is also included a survey in the PEP that checks existing installers that could be affected by the proposed change.

A PoC that implements proposed changes in pip can be found in PR pypa/pip#11865. A small PoC that can reconstruct a Python environment from the proposed provenance_url.json and direct_url.json files can be found at github.com/fridex/pip-preserve. Here is a test to check whether pip freeze would break if the provenance would be stated in the already existing direct_url.json file.

Thanks to everyone who already participated in discussions, and looking forward to any follow up discussions and improvements to the PEP.

Thanks,
F.

PEP sponsor: @dstufft
PEP delegate: @pf_moore

PEP 710
PEP: 710
Title: Recording the provenance of installed packages
Author: FridolĂ­n PokornĂ˝ <fridolin.pokorny at gmail.com>
Sponsor: Donald Stufft <donald@stufft.io>
PEP-Delegate: Paul Moore <p.f.moore@gmail.com>
Discussions-To: https://discuss.python.org/t/pep-710-recording-the-provenance-of-installed-packages/25428
Status: Draft
Type: Standards Track
Topic: Packaging
Content-Type: text/x-rst
Created: 27-Mar-2023
Post-History: `03-Dec-2021 <https://discuss.python.org/t/pip-installation-reports/12316>`__,
              `30-Jan-2023 <https://discuss.python.org/t/pre-pep-recording-provenance-of-installed-packages/23340>`__,
              `14-Mar-2023 <https://discuss.python.org/t/draft-pep-recording-provenance-of-installed-packages/24838>`__,
              `03-Apr-2023 <https://discuss.python.org/t/pep-710-recording-the-provenance-of-installed-packages/25428>`__,

Abstract
========

This PEP describes a way to record the provenance of installed Python distributions.
The record is created by an installer and is available to users in
the form of a JSON file ``provenance_url.json`` in the ``.dist-info`` directory.
The mentioned JSON file captures additional metadata to allow recording a URL to a
:term:`distribution package` together with the installed distribution hash. This
proposal is built on top of :pep:`610` following
:ref:`its corresponding canonical PyPA spec <packaging:direct-url>` and
complements ``direct_url.json`` with ``provenance_url.json`` for when packages
are identified by a name, and optionally a version.

Motivation
==========

Installing a Python :term:`Project` involves downloading a :term:`Distribution Package`
from a :term:`Package Index`
and extracting its content to an appropriate place. After the installation
process is done, information about the release artifact used as well as its source
is generally lost. However, there are use cases for keeping records of
distributions used for installing packages and their provenance.

Python wheels can be built with different compiler flags or supporting
different wheel tags.  In both cases, users might get into a situation in which
multiple wheels might be considered by installers (possibly from different
package indexes) and immediately finding out which wheel file was actually used
during the installation might be helpful. This way, developers can use
information about wheels to debug issues making sure the desired wheel was
actually installed. Another use case could be tools reporting software
installed, such as tools reporting a SBOM (Software Bill of Materials), that might
give more accurate reports. Yet another use case could be reconstruction of the
Python environment by pinning each installed package to a specific distribution
artifact consumed from a Python package index.

Rationale
=========

The motivation described in this PEP is an extension of that in :pep:`610`.
In addition to recording provenance information for packages installed using a direct URL,
installers should also do so for packages installed by name
(and optionally version) from Python package indexes.

The idea described in this PEP originated in a tool called `micropipenv`_
that is used to install
:term:`distribution packages <Distribution Package>` in containerized
environments (see the reported issue `thoth-station/micropipenv#206`_).
Currently, the assembled containerized application does not implicitly carry
information about the provenance of installed distribution packages
(unless these are installed from full URLs and recorded via ``direct_url.json``).
This requires container image suppliers to link
container images with the corresponding build process, its configuration and
the application source code for checking requirements files in cases when
software present in containerized environments needs to be audited.

The `subsequent discussion in the Discourse thread
<https://discuss.python.org/t/12316>`__ also brought up
pip's new ``--report`` option that can
`generate a detailed JSON report <pip_installation_report_>`__ about
the installation process. This option could help with the provenance problem
this PEP approaches. Nevertheless, this option needs to be *explicitly* passed
to pip to obtain the provenance information, and includes additional metadata that
might not be necessary for checking the provenance (such as Python version
requirements of each distribution package). Also, this option is
specific to pip as of the writing of this PEP.

Note the current :ref:`spec for recording installed packages
<packaging:recording-installed-packages>` defines a ``RECORD`` file that
records installed files, but not the distribution artifact from which these
files were obtained. Auditing installed artifacts can be performed
based on matching the entries listed in the ``RECORD`` file. However, this
technique requires a pre-computed database of files each artifact provides or a
comparison with the actual artifact content. Both approaches are relatively
expensive and time consuming operations which could be eliminated with the
proposed ``provenance_url.json`` file.

Recording provenance information for installed distribution packages,
both those obtained from direct URLs and by name/version from an index,
can simplify auditing Python environments in general, beyond just
the specific use case for containerized applications mentioned earlier.
A community project `pip-audit
<https://github.com/pypa/pip-audit>`__ raised their possible interest in
`pypa/pip-audit#170`_.

Specification
=============

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”,
“SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL”
in this document are to be interpreted as described in :rfc:`2119`.

The ``provenance_url.json`` file SHOULD be created in the ``.dist-info``
directory by installers when installing a :term:`Distribution Package`
specified by name (and optionally by :term:`Version Specifier`).

This file MUST NOT be created when installing a distribution package from a requirement
specifying a direct URL reference (including a VCS URL).

Only one of the files ``provenance_url.json`` and ``direct_url.json`` (from :pep:`610`),
may be present in a given ``.dist-info`` directory; installers MUST NOT add both.

The ``provenance_url.json`` JSON file MUST be a dictionary, compliant with
:rfc:`8259` and UTF-8 encoded.

If present, it MUST contain exactly two keys. The first one is ``url``, with
type ``string``.  The second key MUST be ``archive_info`` with a value defined
below.

The value of the ``url`` key MUST be the URL from which the distribution package was downloaded. If a wheel is
built from a source distribution, the ``url`` value MUST be the URL from which
the source distribution was downloaded. If a wheel is downloaded and installed directly,
the ``url`` field MUST be the URL from which the wheel was downloaded.
As in the :ref:`direct URL origin specification<packaging:direct-url>`, the ``url`` value
MUST be stripped of any sensitive authentication information for security reasons.

The user:password section of the URL MAY however be composed of environment
variables, matching the following regular expression:

.. code-block:: text

    \$\{[A-Za-z0-9-_]+\}(:\$\{[A-Za-z0-9-_]+\})?

Additionally, the user:password section of the URL MAY be a well-known,
non-security sensitive string. A typical example is ``git`` in the case of an
URL such as ``ssh://git@gitlab.com``.

The value of ``archive_info`` MUST be a dictionary with a single key
``hashes``.  The value of ``hashes`` is a dictionary mapping hash function names to a
hex-encoded digest of the file referenced by the ``url`` value. Multiple hashes
can be included, and it is up to the consumer to decide what to do with
multiple hashes (it may validate all of them or a subset of them, or nothing at
all).

Each hash MUST be one of the single argument hashes provided by
:data:`py3.11:hashlib.algorithms_guaranteed`, excluding ``sha1`` and ``md5`` which MUST NOT be used.
As of Python 3.11, with ``shake_128`` and ``shake_256`` excluded
for being multi-argument, the allowed set of hashes is:

.. code-block:: python

  >>> import hashlib
  >>> sorted(hashlib.algorithms_guaranteed - {"shake_128", "shake_256", "sha1", "md5"})
  ['blake2b', 'blake2s', 'sha224', 'sha256', 'sha384', 'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512', 'sha512']

Each hash MUST be referenced by the canonical name of the hash, always lower case.

Hashes ``sha1`` and ``md5`` MUST NOT be present, due to the security
limitations of these hash algorithms. Conversely, hash ``sha256`` SHOULD
be included.

Installers that cache distribution packages from an index SHOULD keep
information related to the cached distribution artifact, so that
the ``provenance_url.json`` file can be created even when installing distribution packages
from the installer's cache.

Backwards Compatibility
=======================

Following the :ref:`packaging:recording-installed-packages` specification,
installers may keep additional installer-specific files in the ``.dist-info``
directory.  To make sure this PEP does not cause any backwards compatibility
issues, a :ref:`comprehensive survey of installers and libraries <710-tool-survey>`
found no current tools that are using a similarly-named file,
or other major feasibility concerns.

The :ref:`Wheel specification <packaging:binary-distribution-format>` lists files that can be
present in the ``.dist-info`` directory. None of these file names collide with
the proposed ``provenance_url.json`` file from this PEP.

Presence of provenance_url.json in installers and libraries
-----------------------------------------------------------

A comprehensive survey of the existing installers, libraries, and dependency
managers in the Python ecosystem analyzed the implications of adding support for
``provenance_url.json`` to each tool.
In summary, no major backwards compatibility issues, conflicts or feasibility blockers
were found as of the time of writing of this PEP. More details about the survey
can be found in the :ref:`710-tool-survey` section.

Compatibility with direct_url.json
----------------------------------

This proposal does not make any changes to the ``direct_url.json`` file
described in :pep:`610` and :ref:`its corresponding canonical PyPA spec
<direct-url>`.

The content of ``provenance_url.json`` file was designed in a way to eventually
allow installers reuse some of the logic supporting ``direct_url.json`` when a
direct URL refers to a source archive or a wheel.

The main difference between the ``provenance_url.json`` and  ``direct_url.json``
files are the mandatory keys and their values in the ``provenance_url.json`` file.
This helps make sure consumers of the ``provenance_url.json`` file can rely
on its content, if the file is present in the ``.dist-info`` directory.

Security Implications
=====================

One of the main security features of the ``provenance_url.json`` file is the
ability to audit installed artifacts in Python environments. Tools can check
which Python package indexes were used to install Python :term:`distribution
packages <Distribution Package>` as well as the hash digests of their release
artifacts.

As an example, we can take the recent compromised dependency chain in `the
PyTorch incident <https://pytorch.org/blog/compromised-nightly-dependency/>`__.
The PyTorch index provided a package named ``torchtriton``. An attacker
published ``torchtriton`` on PyPI, which ran a malicious binary. By checking
the URL of the installed Python distribution stated in the
``provenance_url.json`` file, tools can automatically check the source of the
installed Python distribution. In case of the PyTorch incident, the URL of
``torchtriton`` should point to the PyTorch index, not PyPI. Tools can help
identifying such malicious Python distributions installed by checking the
installed Python distribution URL. A more exact check can include also the hash
of the installed Python distribution stated in the ``provenance_url.json``
file. Such checks on hashes can be helpful for mirrored Python package indexes
where Python distributions are not distinguishable by their source URLs, making
sure only desired Python package distributions are installed.

A malicious actor can intentionally adjust the content of
``provenance_url.json`` to possibly hide provenance information of the
installed Python distribution. A security check which would uncover such
malicious activity is beyond scope of this PEP as it would require monitoring
actions on the filesystem and eventually reviewing user or file permissions.

How to Teach This
=================

The ``provenance_url.json`` metadata file is intended for tools and is not
directly visible to end users.

Examples
========

Examples of a valid provenance_url.json
---------------------------------------

A valid ``provenance_url.json`` list multiple hashes:

.. code-block:: json

  {
    "archive_info": {
      "hashes": {
        "blake2s": "fffeaf3d0bd71dc960ca2113af890a2f2198f2466f8cd58ce4b77c1fc54601ff",
        "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
        "sha3_256": "c856930e0f707266d30e5b48c667a843d45e79bb30473c464e92dfa158285eab",
        "sha512": "6bad5536c30a0b2d5905318a1592948929fbac9baf3bcf2e7faeaf90f445f82bc2b656d0a89070d8a6a9395761f4793c83187bd640c64b2656a112b5be41f73d"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }

A valid ``provenance_url.json`` listing a single hash entry:

.. code-block:: json

  {
    "archive_info": {
      "hashes": {
        "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }

A valid ``provenance_url.json`` listing a source distribution which was used to
build and install a wheel:

.. code-block:: json

  {
    "archive_info": {
      "hashes": {
        "sha256": "8bfe29f17c10e2f2e619de8033a07a224058d96b3bfe2ed61777596f7ffd7fa9"
      }
    },
    "url": "https://files.pythonhosted.org/packages/1d/43/ad8ae671de795ec2eafd86515ef9842ab68455009d864c058d0c3dcf680d/micropipenv-0.0.1.tar.gz"
  }

Examples of an invalid provenance_url.json
------------------------------------------

The following example includes a ``hash`` key in the ``archive_info`` dictionary
as originally designed in :pep:`610` and the data structure documented in
:ref:`packaging:direct-url`.
The ``hash`` key MUST NOT be present to prevent from any possible confusion
with ``hashes`` and additional checks that would be required to keep hash
values in sync.

.. code-block:: json

  {
    "archive_info": {
      "hash": "sha256=236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
      "hashes": {
        "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }

Another example demonstrates an invalid hash name. The referenced hash name does not
correspond to the canonical hash names described in this PEP and
in the Python docs under :attr:`py3.11:hashlib.hash.name`.

.. code-block:: json

  {
    "archive_info": {
      "hashes": {
        "SHA-256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  }


Example pip commands and their effect on provenance_url.json and direct_url.json
--------------------------------------------------------------------------------

These commands generate a ``direct_url.json`` file but do not generate a
``provenance_url.json`` file. These examples follow examples from :pep:`610`:

* ``pip install https://example.com/app-1.0.tgz``
* ``pip install https://example.com/app-1.0.whl``
* ``pip install "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"``
* ``pip install ./app``
* ``pip install file:///home/user/app``
* ``pip install --editable "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"`` (in which case, ``url`` will be the local directory where the git repository has been cloned to, and ``dir_info`` will be present with ``"editable": true`` and no ``vcs_info`` will be set)
* ``pip install -e ./app``

Commands that generate a ``provenance_url.json`` file but do not generate
a ``direct_url.json`` file:

* ``pip install app``
* ``pip install app~=2.2.0``
* ``pip install app --no-index --find-links "https://example.com/"``

This behaviour can be tested using changes to pip implemented in the PR
`pypa/pip#11865`_.

Reference Implementation
========================

A proof-of-concept for creating the ``provenance_url.json`` metadata file when
installing a Python :term:`Distribution Package` is available in the PR to pip
`pypa/pip#11865`_. It reuses the already available implementation for the
:ref:`direct URL data structure <packaging:direct-url-data-structure>` to provide
the ``provenance_url.json`` metadata file for cases when ``direct_url.json`` is not
created.

A prototype called `pip-preserve <pip_preserve_>`_ was developed to
demonstrate creation of ``requirements.txt`` files considering ``direct_url.json``
and ``provenance_url.json`` metadata files.  This tool mimics the ``pip
freeze`` functionality, but the listing of installed packages also includes
the hashes of the Python distribution artifacts.

Rejected Ideas
==============

Naming the file direct_url.json instead of provenance_url.json
--------------------------------------------------------------

To preserve backwards compatibility with the
:ref:`Direct URL Origin specification <packaging:direct-url>`,
the file cannot be named ``direct_url.json``, as per the text of that specification:

  This file MUST NOT be created when installing a distribution from an other
  type of requirement (i.e. name plus version specifier).

Such a change might introduce backwards compatibility issues for consumers of
``direct_url.json`` who rely on its presence only when distributions are
installed using a direct URL reference.

Deprecating direct_url.json and using only provenance_url.json
--------------------------------------------------------------

File ``direct_url.json`` is already well established with :pep:`610` being accepted and is
already used by installers. For example, ``pip`` uses ``direct_url.json`` to
report a direct URL reference on ``pip freeze``. Deprecating
``direct_url.json`` would require additional changes to the ``pip freeze``
implementation in pip (see PR `fridex/pip#2`_) and could introduce backwards compatibility
issues for already existing ``direct_url.json`` consumers.

Keeping the hash key in the archive_info dictionary
---------------------------------------------------

:pep:`610` and :ref:`its corresponding canonical PyPA spec <direct-url>` discuss
the possibility to include the ``hash`` key alongside the ``hashes`` key in the
``archive_info`` dictionary. This PEP explicitly does not include the ``hash`` key in
the ``provenance_url.json`` file and allows only the ``hashes`` key to be present.
By doing so we eliminate possible redundancy in the file, possible confusion,
and any additional checks that would need to be done to make sure the hashes are in
sync.

Making the hashes key optional
------------------------------

:pep:`610` and :ref:`its corresponding canonical PyPA spec <direct-url>`
recommend including the ``hashes`` key of the ``archive_info`` in the
``direct_url.json`` file but it is not required (per the :rfc:`21119` language):

  A hashes key SHOULD be present as a dictionary mapping a hash name to a hex
  encoded digest of the file.

This PEP requires the ``hashes`` key be included in ``archive_info``
in the ``provenance_url.json`` file if that file is created; per this PEP:

  The value of ``archive_info`` MUST be a dictionary with a single key
  ``hashes``.

By doing so, consumers of ``provenance_url.json`` can check
artifact digests when the ``provenance_url.json`` file is created by installers.

Open Issues
===========

Availability of the provenance_url.json file in Conda
-----------------------------------------------------

We would like to get feedback on the ``provenance_url.json`` file from the Conda
maintainers. It is not clear whether Conda would like to adopt the
``provenance_url.json`` file. Conda already stores provenance related
information (similar to the provenance information proposed in this PEP) in
JSON files located in the ``conda-meta`` directory `following its actions
during installation
<https://conda.io/projects/conda/en/latest/dev-guide/deep-dives/install.html>`__.

Using provenance_url.json in downstream installers
--------------------------------------------------

The proposed ``provenance_url.json`` file was meant to be adopted primarily by
Python installers. Other installers, such as APT or DNF, might record the
provenance of the installed downstream Python distributions in their own
way specific to downstream package management. The proposed file is
not expected to be created by these downstream package installers and thus they
were intentionally left out of this PEP. However, any input by developers or
maintainers of these installers is valuable to possibly enrich the
``provenance_url.json`` file with information that would help in some way.

.. _710-tool-survey:

Appendix: Survey of installers and libraries
============================================

pip
---

The function from pip's internal API responsible for installing wheels, named
`_install_wheel
<https://github.com/pypa/pip/blob/10d9cbc601e5cadc45163452b1bc463d8ad2c1f7/src/pip/_internal/operations/install/wheel.py#L432>`__,
does not store any ``provenance_url.json`` file in the ``.dist-info``
directory. Additionally, a prototype introducing the mentioned file to pip in
`pypa/pip#11865`_ demonstrates incorporating logic for handling the
``provenance_url.json`` file in pip's source code.

As pip is used by some of the tools mentioned below to install Python package
distributions, findings for pip apply to these tools, as well as pip does not
allow parametrizing creation of files in the ``.dist-info`` directory in its
internal API. Most of the tools mentioned below that use pip invoke pip as a
subprocess which has no effect on the eventual presence of the
``provenance_url.json`` file in the ``.dist-info`` directory.

distlib
-------

`distlib`_ implements low-level functionality to manipulate the
``dist-info`` directory. The database of installed distributions does not use
any file named ``provenance_url.json``, based on `the distlib's source code
<https://github.com/pypa/distlib/blob/05375908c1b2d6b0e74bdeb574569d3609db9f56/distlib/database.py#L39-L40>`__.

Pipenv
------

`Pipenv`_ uses pip `to install Python package distributions
<https://github.com/pypa/pipenv/blob/babd428d8ee3c5caeb818d746f715c02f338839b/pipenv/routines/install.py#L262>`__.
There wasn't any additional identified logic that would cause backwards
compatibility issues when introducing the ``provenance_url.json`` file in the
``.dist-info`` directory.

installer
---------

`installer`_ does not create a ``provenance_url.json`` file explicitly.
Nevertheless, as per the :ref:`Recording Installed Projects <packaging:recording-installed-packages>`
specification, installer allows passing the ``additional_metadata`` argument to
create a file in the ``.dist-info`` directory - see `the source code
<https://github.com/pypa/installer/blob/f89b5d93a643ef5e9858a6e3f450c83a57bbe1f1/src/installer/_core.py#L67>`__.
To avoid any backwards compatibility issues, any library or tool using
installer must not request creating the ``provenance_url.json`` file using the
mentioned ``additional_metadata`` argument.

Poetry
------

The installation logic in `Poetry`_ depends on the
``installer.modern-installer`` configuration option (`see docs
<https://python-poetry.org/docs/configuration#installermodern-installation>`__).

For cases when the ``installer.modern-installer`` configuration option is set
to ``false``, Poetry uses `pip for installing Python package distributions
<https://github.com/python-poetry/poetry/blob/2b15ce10f02b0c6347fe2f12ae902488edeaaf7c/src/poetry/installation/executor.py#L543-L544>`__.

On the other hand, when ``installer.modern-installer`` configuration option is
set to ``true``, Poetry uses `installer to install Python package distributions
<https://github.com/python-poetry/poetry/blob/2b15ce10f02b0c6347fe2f12ae902488edeaaf7c/src/poetry/installation/wheel_installer.py#L99-L109>`__.
As can be seen from the linked sources, there isn't passed any additional
metadata file named ``provenance_url.json`` that would cause compatibility
issues with this PEP.

Conda
-----

`Conda`_ does not create any ``provenance_url.json`` file
`when Python package distributions are installed
<https://github.com/conda/conda/blob/86e83925e17c68233ac659633bdc4d76b05a245a/conda/common/pkg_formats/python.py#L370-L390>`__.

Hatch
-----

`Hatch`_ uses pip `to install project dependencies
<https://github.com/pypa/hatch/blob/dd6e9545a355a0b5b58e065b489c1ef087e3bcaf/src/hatch/env/system.py#L28-L29>`__.

micropipenv
-----------

As `micropipenv`_ is a wrapper on top of pip, it uses
pip to install Python distributions, for both `lock files
<https://github.com/thoth-station/micropipenv/blob/8176862ec96df23e152938659d6f45645246e398/micropipenv.py#L393>`__
as well as `for requirements files
<https://github.com/thoth-station/micropipenv/blob/8176862ec96df23e152938659d6f45645246e398/micropipenv.py#L977>`__.

Thamos
------

`Thamos`_ uses micropipenv `to install Python package
distributions
<https://github.com/thoth-station/thamos/blob/234351025c77cfe28b0df07f7ee017469b57d3f4/thamos/lib.py#L1290>`__,
hence any findings for micropipenv apply for Thamos.

PDM
---

`PDM`_ uses installer `to install binary distributions
<https://github.com/pdm-project/pdm/blob/d39a8e5b36c37093ea31e666d0e55fe21b38c16b/src/pdm/installers/installers.py#L241>`__.
The only additional metadata file it eventually creates in the ``.dist-info``
directory is `the REFER_TO file
<https://github.com/pdm-project/pdm/blob/d39a8e5b36c37093ea31e666d0e55fe21b38c16b/src/pdm/installers/installers.py#L197>`__.

References
==========

.. _pypa/pip#11865: https://github.com/pypa/pip/pull/11865

.. _fridex/pip#2: https://github.com/fridex/pip/pull/2/

.. _pip_preserve: https://pypi.org/project/pip-preserve/

.. _thoth-station/micropipenv#206: https://github.com/thoth-station/micropipenv/issues/206

.. _pypa/pip-audit#170: https://github.com/pypa/pip-audit/issues/170

.. _pip_installation_report: https://pip.pypa.io/en/stable/reference/installation-report/

.. _distlib: https://distlib.readthedocs.io/

.. _Pipenv: https://pipenv.pypa.io/

.. _installer: https://github.com/pypa/installer

.. _Poetry: https://python-poetry.org/

.. _Conda: https://docs.conda.io/

.. _Hatch: https://hatch.pypa.io/

.. _micropipenv: https://github.com/thoth-station/micropipenv

.. _Thamos: https://github.com/thoth-station/thamos/

.. _PDM: https://pdm.fming.dev/

Acknowledgements
================

Thanks to Dustin Ingram, Brett Cannon, and Paul Moore for the initial discussion in
which this idea originated.

Thanks to Donald Stufft, Ofek Lev, and Trishank Kuppusamy for early feedback
and support to work on this PEP.

Thanks to Gregory P. Smith, Stéphane Bidoul, and C.A.M. Gerlach for
reviewing this PEP and providing valuable suggestions.

Thanks to Stéphane Bidoul and Chris Jerdonek for :pep:`610`.

Last, but not least, thanks to Donald Stufft for sponsoring this PEP.

Copyright
=========

This document is placed in the public domain or under the CC0-1.0-Universal
license, whichever is more permissive.
10 Likes

Not to forget, there was also raised a comment, whether this PEP should be more dependent on PEP-610/691 and the direct URL structure - see the relevant discussion.

1 Like

Is there any interest in this proposal?

If no one voices support for this in the next 2-3 weeks, that might be a good argument to mark this PEP as “Deferred” noting a lack of interest.

I’m interested as I think this would allow one to generate a lock file from what’s installed.

2 Likes

My sincere apologies for being inactive here - I’m not very active this month in general. If the community finds the proposed PEP valuable, feel free to let me know what is needed from my end to support the proposal and make some progress here. Thanks!

2 Likes

I am also interested in this PEP. Thanks for creating this PEP @fridex, it goes hand-in-hand with a PEP on SBOMs that I would like to put together (reach out if you are interested in participating in this effort as well).

From my perspective, it would be great to be able to say that an installer that implements both PEP 610 and 710 would always have a corresponding direct_url.json or provenance_url.json file for each installed package depending on how it was installed. Towards that end:

  • What are your thoughts on changing the SHOULD to a MUST when it comes to installers which implement caching?
  • Is there any other situation where an installer could install a distribution without resulting in direct_url.json or provenance_url.json? If not, that may be worth calling out in one or both PEPs to weed out any edge-cases that aren’t be thought of (if any) and provide more confidence for tools building on the outputs of these two standards.

Another thing I thought of was potentially recording the index where the final installable URL was sourced from, for example “https://pypi.org/simple”, since it’s possible for indices to mirror PyPI’s hosted files via redirects. I’m unsure how much that matters from an SBOM POV since a package installed directly versus indirectly from PyPI would result in the same thing getting installed and referenced in an SBOM (ie pkg:pypi/urllib3@2.0.3) but captures the installer intent which may prove useful for other tooling (ie audit that all installed packages are from the internal mirror).

4 Likes

Oh hey, speaking for the conda maintainers, I apologize for having missed this call, just wanted to mention that I’ll review the PEP more closely.

I’m not clear on what conda would gain by supporting PEP 710 itself, but conda-lock might be well-suited with its experimental support for mixed conda/pip environments (aka pip in conda environments). I’ll try to get conda-lock’s author interested in this PEP draft.

2 Likes

I’m interested, feel free to suggest the best channel to sync :slight_smile:

If there would be no objection, we could use MUST. If I remember correctly, we wanted to be backwards compatible (I can try to look up the relevant conversation if needed). The PEP states:

This proposal is built on top of PEP 610 following its corresponding canonical PyPA spec and complements direct_url.json with provenance_url.json for when packages are identified by a name, and optionally a version.
…
Only one of the files provenance_url.json and direct_url.json (from PEP 610), may be present in a given .dist-info directory; installers MUST NOT add both.

which could also answer:

If it’s worth to be more explicit about presence of at least one of these files, we can add it. I cannot think of any case without at least one of direct_url.json and provenance_url.json file present. Please raise any concerns here.

Interesting idea - I’m not that sure about using PURLs though (in general, not just in the Python ecosystem). PURL lacks proper provenance information (e.g. a patched version of urllib3 vs a PyPI urllib3 release). That’s also why checking artifact hashes could be more reliable way to check from where packages were pulled from rather than checking URLs and redirects (which could change).

Thanks!

Interesting idea - I’m not that sure about using PURLs though (in general, not just in the Python ecosystem). PURL lacks proper provenance information (e.g. a patched version of urllib3 vs a PyPI urllib3 release). That’s also why checking artifact hashes could be more reliable way to check from where packages were pulled from rather than checking URLs and redirects (which could change).

The PURL use was incidental showing off which artifact got installed, concretely I meant something like this for the provenance_url.json JSON document (under the "index_url" key, name not final):

{
  "archive_info": {
    "hashes": {
      "blake2s": "fffeaf3d0bd71dc960ca2113af890a2f2198f2466f8cd58ce4b77c1fc54601ff",
      "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
      "sha3_256": "c856930e0f707266d30e5b48c667a843d45e79bb30473c464e92dfa158285eab",
      "sha512": "6bad5536c30a0b2d5905318a1592948929fbac9baf3bcf2e7faeaf90f445f82bc2b656d0a89070d8a6a9395761f4793c83187bd640c64b2656a112b5be41f73d"
    }
  },
  "index_url": "https://pypi.org/simple",
  "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
}
1 Like

I’m registering additional interest in this proposal, as a maintainer of pip-audit – this will (potentially) give us additional context to analyze and return to users during environmental audits.

(My interest is modulo not having reviewed the actual PEP language itself, but I will shortly!)

1 Like

I just did a scan of this PEP, and this looks very useful to me! In its current form, this would allow us to make some nice progressive enhancements to the information produced by pip-audit, as well as make more conservative choices during auditing (e.g., allowing a user to skip a dependency if it indicates that it isn’t from PyPI).

I’ll second this comment by @sethmlarson:

This would obviously take a long time to propagate due to pip’s long tail (among other installing clients), but IMO is worth it.

2 Likes

To create a proof-of-concept for what will become possible with PEP 710-compliant installers wrt SBOM generation, I’ve created an experimental project which consumes the provenance_url.json files and generates an SPDX SBOM document: GitHub - sethmlarson/pip-sbom: Generate Software Bill-of-Materials (SBOMs) for Python environments from distribution metadata

3 Likes

That’s awesome! Would you mind if we add a link to it in the PEP?

Oh, I see. If there would be no objections, we could add it. I don’t have any Python environment handy now, but wondering if pip preserves the URL internally and makes it available also in the cache in the origin.json file - but that is something we could potentially tackle. Thanks!

Thanks for the review and feedback!

That’s awesome! Would you mind if we add a link to it in the PEP?

You certainly can do so.

1 Like

Before doing an in-depth review of the PEP I would still prefer to see a text that references the Direct URL data structure specification, as mentioned in the previous thread and above.

By doing this, producers and consumers can confidently use the same code base for direct_url.json and provenance_url.json (especially since there are ideas of having the Direct URL handling code in a library such as packaging or importlib.metadata). This would also simplify the PEP text significantly, I think.

Regarding an index_url field, I also think it is a very useful idea. The implementation in pip might be more difficult though, to be investigated. It would also need to specify what to do when the source artifact is not obtained from an index nor from a direct URL (e.g. from pip’s --find-links option).

For extensibility and compatibility with the direct URL data structure, I would also investigate if provenance_url.json could be rather structured like so:

{
   "direct_url": {
      "archive_info": {
      "hashes": {
        "blake2s": "fffeaf3d0bd71dc960ca2113af890a2f2198f2466f8cd58ce4b77c1fc54601ff",
        "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f",
        "sha3_256": "c856930e0f707266d30e5b48c667a843d45e79bb30473c464e92dfa158285eab",
        "sha512": "6bad5536c30a0b2d5905318a1592948929fbac9baf3bcf2e7faeaf90f445f82bc2b656d0a89070d8a6a9395761f4793c83187bd640c64b2656a112b5be41f73d"
      }
    },
    "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl"
  },
  "index_url": "https://pypi.org/simple"
}

If we go the route of using an embedded direct_url format, since provenance_url.json only covers resolved package installs would it make sense to put additional restrictions on the direct_url.url value like “must be http[s]://” or “must not be a VCS URL”? I’m not sure what’s possible to serve as a package index or mirror according to the standards.

1 Like

Given that the primary interest in this is folks wanting SBOMs to be better, would it make sense to extend this to allow for build environment information? I’m specifically thinking of the information about what packages are installed in the isolated builds that pip does.

1 Like

That sounds like a great enhancement! That information would also be useful in the direct_url.json situation too. Maybe a separate proposal for capturing that build environment info would be best?

1 Like

Yea, I agree. I had a passing thought based on looking at this, and admittedly it’s a thing that we can/should do independently.

Would it make sense to come up with a prototype for a programmable interface for PEP-610 and PEP-710 as discussed here? While I understand it might be a good idea to reuse PEP-610 parts, I’m not sure how dependent we should be on that PEP. Also, if you have any pointers how the text would be simplified, I would be thankful - I tried to keep the text as minimal as possible, but still keep the necessary context so that the PEP can be read as a standalone document with related links.

Okay, we can add it to the PEP if there are no objections. I also think it could be valuable.

If there are no cases that would be block this, we can add such restrictions. Please do raise any concerns if you can think of any exception (I’m not aware of any).

If there are no objections, we can keep it out of this PEP.

1 Like