PEP 610: Recording the origin of distributions installed from direct URL references

cjerdonek · May 18, 2019, 9:40pm

Okay, here is my more detailed feedback on the PEP after going through the draft in more detail. Thanks for your patience, @sbidoul. Overall I like the PEP (so my comments are mostly minor). I think it will be very useful and fills an important gap. I also offered to @sbidoul to be the sponsor, and he agreed.

My comments:

This proposal defines additional metadata, to be added to the installed
distribution by the installation front end, which records the direct
reference for use by consumers which introspect the database of installed
packages (see PEP 376).

Going along with the idea of making sure that each packaging PEP has good human-friendly names for each concept being introduced (e.g. so people don’t have to use the PEP number, cf. the comment / thread here), I think it would be good to choose explicit names for perhaps (1) the metadata as a whole being defined, and (2) the types of origins that the metadata can include. This way people can talk more easily about this metadata. If this is done, in addition to including them in the abstract, it might also be good to include one of these names in the title.

url MUST be stripped of any sensitive authentication information,
for security reasons. The user:password section of the URL MAY however
be composed of environment variables, matching the following regular
expression::

On the authentication information, would this mean the environment variable names have to match the names that were used in the original invocation? Also, if authentication info was present in the URL but environment variables weren’t used, should there still be some indicator that authentication info was used / needed? (There is also the separate possibility that auth info was required and used, but not via the URL, e.g. from a user prompt. I’m not sure if that should also be reflected somehow.)

A vcs field MUST be present, containing the name of the VCS (i.e. one
of git, hg, bzr, svn). Other VCS SHOULD be registered by amending this PEP.

I think it would be good to include a subsection under the “Specification” section enumerating the registered VCS’s. The subsection can be called something like “Registered version control systems.” For each VCS, I think it should include the full name of the VCS (e.g. “Mercurial” as opposed to hg), a link to the VCS’s website, the VCS’s command name, and the string key that should be used for the vcs field. These sections can also include any additional info or fields specific to that particular VCS.

I would also clarify explicitly whether it’s okay to use VCS’s that aren’t registered, as well as whether it’s okay to use fields that aren’t named in the PEP (for registered VCS’s or unregistered VCS’s). Like, maybe only for unregistered VCS’s, it would be okay to use new fields in case the VCS requires concepts not contemplated by the PEP.

A resolved_commit_id field MUST be present, containing the exact
commit/revision number that was installed.

It might be worth clarifying the type of this value. Should it always be a string, or should it be an integer in the case where the commit ID is a number?

This may require a little extra work / research, but I think it would be worth including a very brief informational subsection on each VCS. The purpose could be to help explain how the fields in the PEP map to the terminology and behavior for that VCS. For example, for Mercurial it could say that resolved_commit_id should be the Mercurial changeset ID rather than a Mercurial revision number. For Subversion, it could say that commit hashes aren’t available, and that the revisions are integers, etc. This would help because not everyone knows about all VCS’s. It would also help for documenting the extent to which each VCS supports requesting individual revisions, and how it does so.

One suggestion I made before that I’ll elaborate on here is that I think a Git-specific resolved-ref field should also be defined (as “MAY” be present). This would be a string that records the Git ref that the given revision string matches, if any (or some value like the empty string if the requested revision was confirmed not to correspond to a ref). This is useful because it would let one know if the requested revision is one of a branch, a tag, or a ref, and which one (based on whether the ref starts with refs/heads/, refs/tags/, or something else, respectively). pip currently implements this detection logic here, but currently it only uses it to know if a branch was requested.

One standards-driven reason for including this information is that it would let a front-end use the @<tag>#<commit-hash> form described in PEP 440 as opposed to just the @<commit-hash> form:

To handle version control systems that do not support including commit or tag references directly in the URL, that information may be appended to the end of the URL using the @<commit-hash> or the @<tag>#<commit-hash> notation.

This is useful because one can’t tell from a commit hash alone what is being installed, but a tag (like a version number) does give this information (and a tag is semantically immutable).

Commands that generate a direct_url.json:
…
* pip install ./app
* pip instal file:///home/user/app

I haven’t thought through what I think the right answer should be here, but I think the behavior should be clarified for installing from directories that also happen to be VCS repositories (but aren’t invoked as such). For example, in this case should the VCS info be recorded? Also, what if the commit ID can be detected, but the working tree is “dirty” or contains extra files? My instinct is that VCS origin info should only be included in the metadata if the directory is being installed via a genuine VCS url as opposed simply to a path, so that e.g. things like dirty working trees wouldn’t affect the install.

sbidoul · May 19, 2019, 11:26am

@cjerdonek thanks a lot for sponsoring this!

I’m not sure what to do with that part. Do you have concrete suggestions?
Shall I add a sentence in the abstract such as “This PEP and the metadata it specifies should be referred to informally as Direct URL Origin”.

Yes. I’ll clarify that.

I don’t think that’s necessary. More precisely, I don’t see how tools would use it. I’d say we can introduce it later if/when a use case pops up.

Ok, I’ll do that.

Good point, I’ll clarify it must be a string.

I still have doubts about this one, mainly because I don’t visualize the use cases.
Producing a PEP 440 url is feasible by combining the requested revision with the resolved_commit_id as @<revision>#<resolved_commit_id>. It’s true PEP 440 mentions <tag> yet I don’t think it meant to exclude branches or any other kind of revision specifier. If revision was itself provided as a commit id, resolving it to a ref is possibly ambiguous.

In which case would a tool use resolved-ref over revision?

I agree. I’ll see to make that explicit in the spec.

sbidoul · May 19, 2019, 1:01pm

New version at 5546deec, and original post updated.

ncoghlan · June 17, 2019, 12:49pm

I see @cjerdonek has handled the PEP sponsorship question (hooray for the increasing number of PyPA folks that are also core developers).

That said, the question does highlight the fact that when we added PEP sponsorship to PEP 1, didn’t consider how it might interact with standing delegations, so it’s currently silent on that topic.

cjerdonek · August 13, 2019, 6:00am

Hi folks, as the sponsor of Stéphane’s (@sbidoul’s) PEP, I just want to say I’m now deeming it ready for submission, and Stéphane will be submitting a PR to the PEP’s repo shortly (I’m guessing within a couple days). He’ll be posting to this thread once he’s done so. I believe his branch is current with what he’ll be submitting: https://github.com/sbidoul/peps/tree/source_url-sbi

I want to thank Stéphane for his patience because it took me some time to get back to him on a number of occasions. I was communicating privately with him on some changes, which is what was happening in the meantime. The changes since then I would say are still on the minor side, which is part of why I’m deeming it ready now. Another reason is that there weren’t any outstanding concerns or objections that I can remember from before.

Finally, we both agreed to add myself as a co-author after working on new wording, which is why I’ll be listed in the Author field rather than the Sponsor field. Thanks, Stéphane, for your continued work on this!

sbidoul · August 13, 2019, 12:10pm

Thanks Chris for you support and work on this matter.

I submitted the PR at https://github.com/python/peps/pull/1145

sbidoul · November 15, 2019, 6:00pm

Hi packaging experts,

The writing and editorial work is done, the draft PEP was assigned number 610, and is ready for review.

Unfortunately it would seem that the PEP is in need for a new sponsor to finalize the process, as @cjerdonek seems to have been unavailable for a while now (unless he chimes in of course – I certainly appreciate the work Chris has done on this so far). So this post also serves as a call for a sponsor – this should not require much work (if any at all) at this stage.

Looking forward to reading any feedback/comment/questions.

uranusjr · November 15, 2019, 7:26pm

What is the use case for the tag key? The PEP did not mentioned one. From what I can tell, there are two necessary fields in the VCS case, requested_revision and commit_id. The former records the ref the user used to specify the requirement, while the latter records the actual revision that requested_revision was resolved intto at install time. A special case is made for Git that branch is used to record the branch name when no branch is specified (BTW why wasn’t the same mentioned for Mercurial?), but tag doesn’t have the same default/implicited requested scenario, and would match requested_revision and become superfulous in all scenarios I can think of.

sbidoul · November 15, 2019, 9:40pm

The rationale for the optional branch, tag, and git_ref keys is is best explained in section Additional origin metadata available for VCS URLs that Chris added. In a nutshell, to record additional information the installer discovered about requested_revision.

The Specification section says that branch and tag are applicable to all VCS. I think branch is repeated in the Git section to provide additional information, but that does not preclude branch to be used for other VCS.

uranusjr · November 16, 2019, 7:01am

So, IIUC for e.g. for Git, requested_revision is resolved into (likely exactly one of) tag, branch, or git_rev. Is this correct?

My concern is how tools (other than pip) would be able to ultilise these fields. The proposed format is easy to write, but more difficult to validate and parse consistently. For example, what should happen if both branch and tag are present? In practice this means all tools would need to match pip’s behaviour, and that makes other implementations more difficult.

A better format IMO would be something like

resolved_revision: Store whatever is discovered at install time.
resolved_type: Provide context for resolved_revision. We can then define a definite list of possible values (and what each of them mean for each VCS) when we implement this in pip.

sbidoul · November 16, 2019, 10:10am

There is a sentence that says “If branch is present, tag MUST not be present.” I’m not sure if Chris had cases in mind where git_ref could be combined with branch or tag.

The very important part is the mandatory commit_id (which was resolved_commit_id in a previous version). It is the one that enables the freeze use case.

branch, tag and git_ref are optional and are merely there to communicate how the installer decided to convert requested_revision to commit_id. For instance if no revision was requested branch would be the default branch that was selected. Or if @HEAD was requested in the case of Git, which branch it corresponded to at installation time.

Regarding use cases exploiting direct_url.json, we have freeze for which pip will use commit_id only (because that’s what requirements.txt supports). Tools freezing to PEP 440 format would use commit_id and tag if available. There is currently no use case for branch that I know of, but I can imagine it is an enabler for new workflows.

Nevertheless there is no obligations for installers to generate them, and tools exploiting direct_url.json must work correctly if they are absent (unless they assume a specific installer was used).

Assuming commit_id (or resolved_commit_id) is preserved, why not.
Alternatively, would it be ok for you if we write more prominently that tag, branch and git_ref are mutually exclusive?

To be honest I had similar concerns as yours and while I came to get used to the current spec for tag, branch and git_ref, I really wish @cjerdonek was available to explain that part.

uranusjr · November 16, 2019, 3:30pm

I’d still prefer the type + value format since it is simpler to parse, but mutually exclusive keys would definitely work as well. And of course, commit_id must be preserved, since it is essential (also requested_revision for the matter) for replicating what happened at install time, no objecttions here.

sbidoul · December 7, 2019, 12:29pm

Hi,

A gentle bump for this thread. PEP 610 is up for review (thanks @uranusjr for the review so far). It is also in need for a core dev sponsor to help move the process forward - this should not require much work as all the editorial work has been done and no controversial comment has been raised so far.

I take the chance to summarize why it is important, mainly for people who need to work with VCS requirements (of the form project @ git+https://...).

After installing such requirements, one has no way to discover that a distribution was installed from a VCS url by inspecting the database of installed distributions. This is a problem for, e.g. pip freeze. I also suspect it requires other tools (such as poetry) to duplicate code (VCS checkout, etc) for something that is otherwise done perfectly by pip.

The workaround people have been using is to use --editable installs for VCS requirements. This has several important drawbacks:

it distorts the original intent of editable installs (i.e. development installs)
it’s currently not supported for projects using PEP 517
it does not allow caching

chrahunt · December 9, 2019, 5:28pm

A few questions:

if I run pip install . in several projects it would not be very useful to have "url": "." in the direct_url.json. One approach could be to canonicalize the url and store it in this field. Another could be to store the requested ref and context separately e.g. {"url": ".", "url-context": "/home/user/pkg"}. Has this already been discussed?
Given that, must we specify “relative ref” path interpretation expectations for tools or separate the action somehow? Currently these seem under-specified (related discussion here).
I didn’t see any encoding mentioned for direct_url.json. RFC 8259 has a few words about it in 8.1 (i.e. MUST be encoded using UTF-8). Is that enough or do we want to mention it explicitly?

sbidoul · December 10, 2019, 9:18pm

@chrahunt regarding encoding, I’ll clarify that it’s UTF-8, for the avoidance of doubt.

Regarding relative URLs, I’d wait for the outcome of the discussion in your orignal thread to decide anything here.

The pip install . case (in general when providing top level requirements as local paths) can be solved with canonicalized URLs.

I’m concerned with the use of relative paths in install_requires. If allowed, that sounds quite complex, allowing e.g. sdists inside sdists, and such references cannot be convert to URLs in frozen requirements or a lockfile.

ncoghlan · January 9, 2020, 11:45pm

(Arriving from https://github.com/pypa/packaging-problems/issues/256#issuecomment-572453850)

I’m happy to sponsor this PEP (I didn’t realise you were having trouble getting in touch with Chris).

For local installs, I think we should store them as absolute file URLs, without implicitly resolving symlinks.

For the VCS resolution information, perhaps it would be clearer to store that in a “vcs_info” subdictionary, rather than having it as top-level with the other keys? The extracted VCS type could also move down into that.

Similarly, the hash information could move into an “archive_info” subdict, applicable when an sdist or wheel is supplied directly. I don’t think this information should be present when the URL is a local directory.

Finally, coming from the editable mode discussion thread, I believe that information could also go in this file as a “dir_info” subdict, with an “editable” key. The permitted values would match the “root is purelib” setting in the wheel metadata file.

sbidoul · January 10, 2020, 2:07pm

@ncoghlan That is excellent news. Thank you!

The subdictionaries approach looks good.

Ok with absolute URLs. Could you elaborate a bit your reasoning about symlinks?

Ok with flagging editable installs in dir_info. In that case the URL would be understood as pointing to the project location, i.e. the local directory where pyproject.toml or setup.py resides.
I’m not sure about “root is purelib”, though, can you elaborate that part?

I’d also add a _version field (as in version of the spec) to allow for evolution.

Regarding the process, shall I do a PR to python/peps with the various improvements following the comments collected here?

ncoghlan · January 12, 2020, 4:27am

If only the resolved path is stored, then there’s no way to reconstruct the actual installation command that was executed, and tools would potentially need extra options to say whether or not to resolve local symlinks when generating the metadata.

By contrast, if the unresolved form is stored at the top level, then developers can choose to store the resolved form by running pip install $(readlink $PATH_TO_INSTALL) instead of pip install $PATH_TO_INSTALL.

That said, it may also make sense to store the fully resolved path in both archive_info and dir_info, so tools can detect when symlinks have changed since a package was installed.

Ignore that part and just make the editable key a JSON boolean (I forgot that the wheel metadata was defined as email header style key-value pairs, so everything’s a string, even boolean fields)

Don’t add that, for the same reasons we didn’t add it to pyproject.toml: PEP 518 – Specifying Minimum Build System Requirements for Python Projects | peps.python.org

We should strive really hard to avoid incompatible changes in our metadata, and instead evolve them through new optional fields that have sensible defaults. If we ever run into a problem that forces a change to an existing field, then we can add the metadata version field then, with an implied default of “1.0”, and the new field opting in to “2.0”

Yep, that’s a good way to do it.

sbidoul · January 15, 2020, 10:49pm

I opened a PR with improvements following the comments collected so far.

sbidoul · January 17, 2020, 10:49am

The PR was merged. @uranusjr @chrahunt @ncoghlan let me know if I correctly handled your comments and concerns.