Draft PEP: Recording the origin of distributions installed from direct URL references

The rationale for the optional branch, tag, and git_ref keys is is best explained in section Additional origin metadata available for VCS URLs that Chris added. In a nutshell, to record additional information the installer discovered about requested_revision.

The Specification section says that branch and tag are applicable to all VCS. I think branch is repeated in the Git section to provide additional information, but that does not preclude branch to be used for other VCS.

So, IIUC for e.g. for Git, requested_revision is resolved into (likely exactly one of) tag, branch, or git_rev. Is this correct?

My concern is how tools (other than pip) would be able to ultilise these fields. The proposed format is easy to write, but more difficult to validate and parse consistently. For example, what should happen if both branch and tag are present? In practice this means all tools would need to match pip’s behaviour, and that makes other implementations more difficult.

A better format IMO would be something like

  • resolved_revision: Store whatever is discovered at install time.
  • resolved_type: Provide context for resolved_revision. We can then define a definite list of possible values (and what each of them mean for each VCS) when we implement this in pip.

There is a sentence that says “If branch is present, tag MUST not be present.” I’m not sure if Chris had cases in mind where git_ref could be combined with branch or tag.

The very important part is the mandatory commit_id (which was resolved_commit_id in a previous version). It is the one that enables the freeze use case.

branch, tag and git_ref are optional and are merely there to communicate how the installer decided to convert requested_revision to commit_id. For instance if no revision was requested branch would be the default branch that was selected. Or if @HEAD was requested in the case of Git, which branch it corresponded to at installation time.

Regarding use cases exploiting direct_url.json, we have freeze for which pip will use commit_id only (because that’s what requirements.txt supports). Tools freezing to PEP 440 format would use commit_id and tag if available. There is currently no use case for branch that I know of, but I can imagine it is an enabler for new workflows.

Nevertheless there is no obligations for installers to generate them, and tools exploiting direct_url.json must work correctly if they are absent (unless they assume a specific installer was used).

Assuming commit_id (or resolved_commit_id) is preserved, why not.
Alternatively, would it be ok for you if we write more prominently that tag, branch and git_ref are mutually exclusive?

To be honest I had similar concerns as yours and while I came to get used to the current spec for tag, branch and git_ref, I really wish @cjerdonek was available to explain that part.

I’d still prefer the type + value format since it is simpler to parse, but mutually exclusive keys would definitely work as well. And of course, commit_id must be preserved, since it is essential (also requested_revision for the matter) for replicating what happened at install time, no objecttions here.

1 Like

Hi,

A gentle bump for this thread. PEP 610 is up for review (thanks @uranusjr for the review so far). It is also in need for a core dev sponsor to help move the process forward - this should not require much work as all the editorial work has been done and no controversial comment has been raised so far.

I take the chance to summarize why it is important, mainly for people who need to work with VCS requirements (of the form project @ git+https://...).

After installing such requirements, one has no way to discover that a distribution was installed from a VCS url by inspecting the database of installed distributions. This is a problem for, e.g. pip freeze. I also suspect it requires other tools (such as poetry) to duplicate code (VCS checkout, etc) for something that is otherwise done perfectly by pip.

The workaround people have been using is to use --editable installs for VCS requirements. This has several important drawbacks:

  • it distorts the original intent of editable installs (i.e. development installs)
  • it’s currently not supported for projects using PEP 517
  • it does not allow caching

:pray:

A few questions:

  1. if I run pip install . in several projects it would not be very useful to have "url": "." in the direct_url.json. One approach could be to canonicalize the url and store it in this field. Another could be to store the requested ref and context separately e.g. {"url": ".", "url-context": "/home/user/pkg"}. Has this already been discussed?
  2. Given that, must we specify “relative ref” path interpretation expectations for tools or separate the action somehow? Currently these seem under-specified (related discussion here).
  3. I didn’t see any encoding mentioned for direct_url.json. RFC 8259 has a few words about it in 8.1 (i.e. MUST be encoded using UTF-8). Is that enough or do we want to mention it explicitly?

@chrahunt regarding encoding, I’ll clarify that it’s UTF-8, for the avoidance of doubt.

Regarding relative URLs, I’d wait for the outcome of the discussion in your orignal thread to decide anything here.

The pip install . case (in general when providing top level requirements as local paths) can be solved with canonicalized URLs.

I’m concerned with the use of relative paths in install_requires. If allowed, that sounds quite complex, allowing e.g. sdists inside sdists, and such references cannot be convert to URLs in frozen requirements or a lockfile.

(Arriving from https://github.com/pypa/packaging-problems/issues/256#issuecomment-572453850)

I’m happy to sponsor this PEP (I didn’t realise you were having trouble getting in touch with Chris).

For local installs, I think we should store them as absolute file URLs, without implicitly resolving symlinks.

For the VCS resolution information, perhaps it would be clearer to store that in a “vcs_info” subdictionary, rather than having it as top-level with the other keys? The extracted VCS type could also move down into that.

Similarly, the hash information could move into an “archive_info” subdict, applicable when an sdist or wheel is supplied directly. I don’t think this information should be present when the URL is a local directory.

Finally, coming from the editable mode discussion thread, I believe that information could also go in this file as a “dir_info” subdict, with an “editable” key. The permitted values would match the “root is purelib” setting in the wheel metadata file.

5 Likes

@ncoghlan That is excellent news. Thank you!

The subdictionaries approach looks good.

Ok with absolute URLs. Could you elaborate a bit your reasoning about symlinks?

Ok with flagging editable installs in dir_info. In that case the URL would be understood as pointing to the project location, i.e. the local directory where pyproject.toml or setup.py resides.
I’m not sure about “root is purelib”, though, can you elaborate that part?

I’d also add a _version field (as in version of the spec) to allow for evolution.

Regarding the process, shall I do a PR to python/peps with the various improvements following the comments collected here?

If only the resolved path is stored, then there’s no way to reconstruct the actual installation command that was executed, and tools would potentially need extra options to say whether or not to resolve local symlinks when generating the metadata.

By contrast, if the unresolved form is stored at the top level, then developers can choose to store the resolved form by running pip install $(readlink $PATH_TO_INSTALL) instead of pip install $PATH_TO_INSTALL.

That said, it may also make sense to store the fully resolved path in both archive_info and dir_info, so tools can detect when symlinks have changed since a package was installed.

Ignore that part and just make the editable key a JSON boolean (I forgot that the wheel metadata was defined as email header style key-value pairs, so everything’s a string, even boolean fields)

Don’t add that, for the same reasons we didn’t add it to pyproject.toml: https://www.python.org/dev/peps/pep-0518/#a-semantic-version-key

We should strive really hard to avoid incompatible changes in our metadata, and instead evolve them through new optional fields that have sensible defaults. If we ever run into a problem that forces a change to an existing field, then we can add the metadata version field then, with an implied default of “1.0”, and the new field opting in to “2.0”

Yep, that’s a good way to do it.

I opened a PR with improvements following the comments collected so far.

The PR was merged. @uranusjr @chrahunt @ncoghlan let me know if I correctly handled your comments and concerns.

3 Likes

Structurally, I think the key thing remaining to do is to nominate a long-lived URL under “packaging.python.org/specifications” where this spec will live, and change the references to “amend this PEP” to instead say “write a PEP to amend this specification”.

At a design level, you identified something I had missed with editable installs, which is that they may be specified in more detail at installation time than just “install from this existing local path”.

Relatedly, for local directories that happen to be VCS directories, it may be useful to take a “snapshot” of the VCS state at the time of installation.

However, I think there’s enough additional complexity in trying to track that as part of the “dir_info” struct that it makes sense to defer tackling that to a possible future iteration (e.g. for editable installs, snapshots of discovered VCS info would likely be better created by “pip freeze” itself, whereas for regular installs, the VCS or archive info from the time of installation would be more relevant. For local directories that happen to be VCS directories, it would likely make sense to extract relevant remote repo information).

For now, I think we can safely duck all those questions by saying “The initial iteration of the PEP does not attempt to make environments that include editable installs or installs from local directories reproducible, but it does attempt to make them readily identifiable”.

A quick reality check here - with that qualification, does the PEP (still) address the requirement that originally started this whole discussion (back in this topic)? It’s not obvious to me that it does (because reproducibility seems to be relevant to that use case).

There’s been a lot of water under the bridge since that original post, and it’s worth checking that we’re still solving the problem we started with.

It does. There were two possible approaches to to achieve the original goal (namely make pip freeze work in presence of VCS requirements): 1/ make editable work with pep 517 or 2/ not using editable for that purpose and writing a new spec. Along the way came a better understanding that using editable for the purpose of pip freeze to work correctly with VCS requirements is a distortion of what editable installs are meant for (namely installing a local directory in development mode). So what we have here is a much cleaner and robust approach that does not require editables at all for that purpose.

The latest version of PEP 610 also achieves another, independent goal, which is making the project location of “modern” [1] editable installs identifiable.

[1] By “modern” I mean editable installs that also provide .dist-info metadata, which is the only assumption we are making about them so far, irrespective of the mechanism used to install them.

1 Like

I can add that to the spec. Although, as you noted, discovering the location of the local project directory is sufficient for known use cases of editable installs.

1 Like

I added the last suggestions from @ncoghlan at https://github.com/python/peps/pull/1282 and https://github.com/pypa/packaging.python.org/pull/690.

For people interested in testing this, a proposed implementation in pip is available at https://github.com/pypa/pip/pull/7612.

1 Like

While testing the implementation I found one minor issue with the stripping of user:password information from URLs. In the case of a requirement such as packaging @ git+ssh://git@github.com/pypa/packaging stripping the git@ part does not improve security and makes pip freeze produce a non-working requirement, providing a sub-optimal UX.

So I propose to add the following language in the PEP:

Additionally, the user:password section of the URL MAY be composed of well-known, non security sensitive strings, such as git in the case of URLs such as ssh://git@gitlab.com.

I made a last PR including my previous remark and a small correction.

To the best of my knowledge all review comments have been handled.

@ncoghlan, all, do you think it is ready to move forward?

Aye, I think so, which means the next step would be to ask @pf_moore as the default BDFL-Delegate for packaging PEPs if he’d like to handle this PEP himself, or if he’d prefer to appoint another volunteer to handle it.