Draft PEP: Recording the origin of distributions installed from direct URL references

I have integrated all remarks so far at https://github.com/sbidoul/peps/blob/source_url-sbi/pep-9999.rst .

I haven’t had time to provide detailed feedback on this proposal. But rather than wait longer, I’ll provide some feedback now – though not very detailed.

Since each VCS has its own options and peculiarities and ways of specifying and doing things, I think that – except for things like url and subdirectory that apply obviously to all VCS’s – more specific fields should be defined separately for each VCS.

For Git, instead of ref which should normally only be a name that starts with “refs/”, I think it would make sense to have a field called something like revision or requested-revision for the requested revision / “commit-ish” / “committish” (example values include things like “master”, “1.0”, “refs/pull/6447/head”, a full or partial sha, etc.), and a separate field called something like resolved-ref for what ref the requested revision resolved to, if any (e.g. “refs/heads/master”, “refs/tags/1.0”, “refs/pull/6447/head”, etc.). The latter tells you in particular whether the requested revision was a branch or tag or some other type of ref, based on whether it begins with “refs/heads/” or “refs/tags/”. pip already does this resolution. I haven’t thought through whether it would make sense, but maybe the spec should also spell out what happens if the reference is ambiguous (i.e. if the revision corresponds to more than one ref), and also be able to indicate if the requested revision doesn’t correspond to any ref.

To provide further evidence that VCS-specific fields make sense, a future possible field (but not for this PEP) that applies to Git but wouldn’t to other VCS’s is a list of what submodules were fetched prior to installing: https://github.com/pypa/pip/issues/6374

I also feel like commit-hash should be renamed to something like commit-id, and then say separately for each VCS what that field means / what value the field should take on. To me it seems like the more important and useful property of this field is that it be the way to unambiguously identify a commit for that VCS. This would let us support things like central SVN repositories where a revision number is sufficient to unambiguously identify a commit, and that’s the best you can do for that VCS (as there is no notion of a commit hash).

By the way, it looks like the phrase “Draft PEP” / “Proposed PEP” which I suggested before was dropped from the title of this thread.

No time to do anything like a proper review, but I wanted to comment on the following point:

Rather than referring to the capabilities of existing installers, it would be better to link this proposal back to existing standards.

In this case the relevant standard is PEP 440, which you mention in terms of the direct reference spec, but which also makes the comments:

Distributions are identified by a public version identifier which supports all defined version comparison operations

(in “Version Scheme”) and

Some automated tools may permit the use of a direct reference as an alternative to a normal version specifier. A direct reference consists of the specifier @ and an explicit URL.

(in “Direct References”).

Combining these two, which essentially define exactly how a user can specify a project to an installation tool it makes sense to me to describe the context of this proposal without referring to specific tools like pip or their features, something along the lines of:

Following PEP 440, a distribution can be identified by a name and either a version, or a direct reference (add appropriate links or explanations here). The name and version are captured in the project metadata, but currently there is no way to obtain details of the URL used when the distribution was identified by a direct reference. This proposal defines additional metadata, to be added to the installed distribution by the installation front end, which records the direct reference URL for use by consumers which introspect the database of installed packages (see PEP 376).

With this framing, pip freeze simply becomes the motivating example of a consumer that needs that information stored in the database of installed packages.

I’m fine with the wording you propose which is indeed a nice way to frame the proposal.

Would you prefer to see your paragraph in the abstract, and keep the motivation section using examples as it is now? Or rather shrink the motivation section to remove or tone down references to capabilities of existing installers and the freeze use case?

ref is indeed meant to encode the reference/revision that was requested, while commit-hash is meant to be the resolved ref. I’ll see to clarify the wording in that respect (see proposal below).

The use of ref was inspired mainly

  • by PEP 440 direct references (and pip URL references) which do not distinguish between tags and branches in their URL format
  • Ruby Bundler which uses ref
  • Pipfile

So it is sufficient to encode all types of direct references known today.

About submodules, this can indeed be covered later by future specs which can extend direct_url.json.

I would not rename ref to requested-revision because, like url, it is implied by the spec that it is the requested one.

The spec does not mandate any specific use of the new metadata so I suppose this part can be left out of this PEP?

commit-id is indeed more generic and covers SVN.

Tools would then need to combine the knowledge of VCS type plus the presence of commit-id to decide if the reference refers to an immutable version of the code, not just the presence of commit-hash. That is fine with me.

So I could to update the spec to replace commit-hash by resolved-commit-id and say that VCS that support hash based commit references MUST use it in that field.

Assuming the updated text above (resolved-commit-id), do you have specific use cases in mind for an additional resolved-ref field?

@cjerdonek I updated the Specification section to talk about resolved-commit-id. Let me know your thoughts.

@pf_moore I used your text in the abstract. So far I left the Motivation section untouched, as concrete examples known today. Let me know if you think this section needs updating too (or shrinking as it may be too obvious?).

Enjoy pycon.

Hi,

I would like to progress with implementation of pip#609.

Are there any additional comments on this specification, or suggestions on the best way to move forward?

You need a sponsor to get this registered as an actual PEP, and then it needs to be reviewed, and ultimately signed off by an appropriate BDFL-Delegate (which would likely be me, as this is a packaging interoperability PEP).

You should keep the PEP text in the initial post here up to date with the master version - at the moment it seems like it’s drifted. And ideally, a formatted version posted here would be easier to read.

At the moment, you’ve had a couple of comments on the proposal, and nothing particularly negative, but there’s been no really positive support, either, so I’d question how useful people actually find this. The pip issue has been round since 2012, so it’s not exactly a showstopper…

Also, I’m a little unclear as to how much this will “lock in” VCS-specific URL formats into the interop specs - pip uses its own notation (git+https://...., plus #egg= fragments) and I’ve no idea how standard these are or whether other tools could want to use a different approach.

So IMO, there’s still a chunk of process that needs to be followed, as well as some outstanding discussion that needs to be had, before this is ready to be accepted.

I think one reason it has not been implemented yet is that people can work around it by bending editable installs backwards so pip freeze works reliably with vcs urls, where they might not need --editable in their requirement files if pip#609 was implemented.

The other reason is implementing it requires updating the database of installed distributions, which in turn requires a PEP, which is a difficult process (especially for someone like me who is not that well connected in the python community to find sponsors, and just trying to contribute as upstream as possible in the hope to be useful to the widest community)… Comparatively, my previous contributions to pypa (pip, setuptools_scm) were a breeze :slight_smile:

In the Specification section, I’ve been particularly careful to not introduce such lock in. I’ve examined PEP 440 and pipfile formats in addition to pip’s native format and made sure the json spec is generic and can be used to generate them all. The git+https://... format is not part of the spec, as I split the vcs type and url in different fields.

I’ll update the original post. My initial reasoning for not doing it was to keep the conversation consistent with the original post. [edit: done]

I think the only python core dev (besides you) who commented on the spec so far are @brettcannon and @ncoghlan. Apologies for the mention, Nick, since you seemed rather positive on the draft, would you accept to sponsor it? Or could another pip maintainer sponsor it given the narrow scope, despite not being a python core dev? Maybe @pradyunsg since you commented on pip#609 that it was still a valid issue?

I haven’t had time to reply to your response to my comments as you didn’t incorporate my suggestions. I would like to, and I will try to do so soon.

I’m also a core dev who commented on the draft. I’m also the pip maintainer who has been doing the most work on the area of the code affected by this PEP. I would even say it’s been the focus of my pip work. I’ve made probably dozens of commits – fixing bugs, features, adding tests, reviewing patches, and refactoring, which is still continuing.

Ooops, sincere apologies, I don’t how I did not notice you are a core dev.
So you are indeed a good candidate to sponsor this if you think it’s a good idea.

I did incorporate part of your suggestions in this commit.
I commented on the rest in post #11.

Thanks for your time on this matter, looking forward to reading your further comments.

I’m going to start out by replying only to the part about choosing a name different from ref – in part because of time, but also because this is a long message. Other portions like adding resolved-ref I will do in a separate message.

My original comment was more about asking not to use the word “ref” rather than to add the prefix “requested”. requested-revision was just one of the first alternatives that came to mind; revision was the other:

The reason not to use “ref” is that “ref” is a term that has a specific, different meaning in Git, and I think there are plenty of alternatives to choose from that don’t have this problem.

In Git, a ref is a string that begins with the prefix “refs/” and is a particular type of reference. From Git’s glossary:

ref - A name that begins with refs/ (e.g. refs/heads/master) that points to an object name or another ref (the latter is called a symbolic ref).

In Git at least, the most appropriate word for the concept being referred to by this PEP would be “commit-ish” or “committish.” Here is a link to Git’s glossary entry. Git’s documentation also tends to use “revision” and <rev> when speaking about ways to refer to a particular commit. For example, Git’s gitrevisions documentation has a section on “Specifying Revisions”.

Also, pip’s VCS code uses the variable name rev (for revision). This is why I think something like revision (or requested-revision, to distinguish from an actual revision) would be a lot better, and I don’t see any downside. committish would also be fine I think.

pip supports installing any committish, and not just revisions that resolve to refs.

As a baseline, here is an example of pip-installing from a branch (branch “azure-pipelines”, which resolves to ref “refs/heads/azure-pipelines”):

$ pip install git+https://github.com/python-attrs/attrs.git@azure-pipelines#egg=attrs
Collecting attrs from git+https://github.com/python-attrs/attrs.git@azure-pipelines#egg=attrs
  Cloning https://github.com/python-attrs/attrs.git (to revision azure-pipelines) to /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-q5aqzlw7/attrs
  Running command git clone -q https://github.com/python-attrs/attrs.git /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-q5aqzlw7/attrs
  Running command git checkout -b azure-pipelines --track origin/azure-pipelines
  Switched to a new branch 'azure-pipelines'
...
Successfully installed attrs-19.2.0.dev0

Here is installing from a tag (tag “17.2.0”, which resolves to ref “refs/tags/17.2.0”):

$ pip install git+https://github.com/python-attrs/attrs.git@17.2.0#egg=attrs
Collecting attrs from git+https://github.com/python-attrs/attrs.git@17.2.0#egg=attrs
  Cloning https://github.com/python-attrs/attrs.git (to revision 17.2.0) to /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-wcf_4dct/attrs
  Running command git clone -q https://github.com/python-attrs/attrs.git /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-wcf_4dct/attrs
  Running command git checkout -q 8f95ef7467f33efd3f7f0193a6b9bf714195eaf6
...
Successfully installed attrs-17.2.0

Here is installing from the committish “17.2.0~2”, which means two commits before the above tag and is not a ref and does not correspond to a ref (notice the log message, “WARNING: Did not find branch or tag ‘d0806d9d2fa’, assuming revision or ref”):

$ pip install "git+https://github.com/python-attrs/attrs.git@17.2.0~2#egg=attrs"
Collecting attrs from git+https://github.com/python-attrs/attrs.git@17.2.0~2#egg=attrs
  Cloning https://github.com/python-attrs/attrs.git (to revision 17.2.0~2) to /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-259feqcb/attrs
  Running command git clone -q https://github.com/python-attrs/attrs.git /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-259feqcb/attrs
  WARNING: Did not find branch or tag '17.2.0~2', assuming revision or ref.
  Running command git checkout -q '17.2.0~2'
...
Successfully installed attrs-17.2.0.dev0

Finally, here is an example of installing from the committish “d0806d9d2fa”, which is an abbreviated sha and is not a ref and does not correspond to a ref:

$ pip install git+https://github.com/python-attrs/attrs.git@d0806d9d2fa#egg=attrs
Collecting attrs from git+https://github.com/python-attrs/attrs.git@d0806d9d2fa#egg=attrs
  Cloning https://github.com/python-attrs/attrs.git (to revision d0806d9d2fa) to /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-lmge3z1d/attrs
  Running command git clone -q https://github.com/python-attrs/attrs.git /private/var/folders/q9/j0_5hxt88v5592006s6dd3n80000gs/T/pip-install-lmge3z1d/attrs
  WARNING: Did not find branch or tag 'd0806d9d2fa', assuming revision or ref.
  Running command git checkout -q d0806d9d2fa
...
Successfully installed attrs-19.2.0.dev0

pip can also even install from an actual ref.

Regarding Pipfile, I talked with @techalchemy about this at PyCon, and he said that if he could do it over again, he wouldn’t use the word “ref” for the same reasons I mentioned above – that these aren’t refs but rather something more general including things like branch names, tags, commit SHA’s, abbreviated commit SHA’s, etc.

PS - I’m open to considering sponsoring as I support the idea in concept. I just need to review what that entails first because as I recall that could impose some restrictions on my involvement later.

@cjerdonek thanks for the detailed explanation. revision is fine with me. I updated the draft.

TIL pip supports installing from git refs. That’s a great improvement (is it recent?), very useful to install a GitHub pull request.

Great, thanks!

It’s from a couple years ago and was first added in pip 10.0 (so somewhat recent but not too recent): https://github.com/pypa/pip/pull/4429
And indeed, the motivation was for checking out PR’s more easily. I think this feature might not be documented, which is one reason more people might not know about it. (It would be good to document this.)

Okay, here is my more detailed feedback on the PEP after going through the draft in more detail. Thanks for your patience, @sbidoul. Overall I like the PEP (so my comments are mostly minor). I think it will be very useful and fills an important gap. I also offered to @sbidoul to be the sponsor, and he agreed.

My comments:

This proposal defines additional metadata, to be added to the installed
distribution by the installation front end, which records the direct
reference for use by consumers which introspect the database of installed
packages (see PEP 376).

Going along with the idea of making sure that each packaging PEP has good human-friendly names for each concept being introduced (e.g. so people don’t have to use the PEP number, cf. the comment / thread here), I think it would be good to choose explicit names for perhaps (1) the metadata as a whole being defined, and (2) the types of origins that the metadata can include. This way people can talk more easily about this metadata. If this is done, in addition to including them in the abstract, it might also be good to include one of these names in the title.

url MUST be stripped of any sensitive authentication information,
for security reasons. The user:password section of the URL MAY however
be composed of environment variables, matching the following regular
expression::

On the authentication information, would this mean the environment variable names have to match the names that were used in the original invocation? Also, if authentication info was present in the URL but environment variables weren’t used, should there still be some indicator that authentication info was used / needed? (There is also the separate possibility that auth info was required and used, but not via the URL, e.g. from a user prompt. I’m not sure if that should also be reflected somehow.)

A vcs field MUST be present, containing the name of the VCS (i.e. one
of git, hg, bzr, svn). Other VCS SHOULD be registered by amending this PEP.

I think it would be good to include a subsection under the “Specification” section enumerating the registered VCS’s. The subsection can be called something like “Registered version control systems.” For each VCS, I think it should include the full name of the VCS (e.g. “Mercurial” as opposed to hg), a link to the VCS’s website, the VCS’s command name, and the string key that should be used for the vcs field. These sections can also include any additional info or fields specific to that particular VCS.

I would also clarify explicitly whether it’s okay to use VCS’s that aren’t registered, as well as whether it’s okay to use fields that aren’t named in the PEP (for registered VCS’s or unregistered VCS’s). Like, maybe only for unregistered VCS’s, it would be okay to use new fields in case the VCS requires concepts not contemplated by the PEP.

A resolved_commit_id field MUST be present, containing the exact
commit/revision number that was installed.

It might be worth clarifying the type of this value. Should it always be a string, or should it be an integer in the case where the commit ID is a number?

This may require a little extra work / research, but I think it would be worth including a very brief informational subsection on each VCS. The purpose could be to help explain how the fields in the PEP map to the terminology and behavior for that VCS. For example, for Mercurial it could say that resolved_commit_id should be the Mercurial changeset ID rather than a Mercurial revision number. For Subversion, it could say that commit hashes aren’t available, and that the revisions are integers, etc. This would help because not everyone knows about all VCS’s. It would also help for documenting the extent to which each VCS supports requesting individual revisions, and how it does so.

One suggestion I made before that I’ll elaborate on here is that I think a Git-specific resolved-ref field should also be defined (as “MAY” be present). This would be a string that records the Git ref that the given revision string matches, if any (or some value like the empty string if the requested revision was confirmed not to correspond to a ref). This is useful because it would let one know if the requested revision is one of a branch, a tag, or a ref, and which one (based on whether the ref starts with refs/heads/, refs/tags/, or something else, respectively). pip currently implements this detection logic here, but currently it only uses it to know if a branch was requested.

One standards-driven reason for including this information is that it would let a front-end use the @<tag>#<commit-hash> form described in PEP 440 as opposed to just the @<commit-hash> form:

To handle version control systems that do not support including commit or tag references directly in the URL, that information may be appended to the end of the URL using the @<commit-hash> or the @<tag>#<commit-hash> notation.

This is useful because one can’t tell from a commit hash alone what is being installed, but a tag (like a version number) does give this information (and a tag is semantically immutable).

Commands that generate a direct_url.json:

* pip install ./app
* pip instal file:///home/user/app

I haven’t thought through what I think the right answer should be here, but I think the behavior should be clarified for installing from directories that also happen to be VCS repositories (but aren’t invoked as such). For example, in this case should the VCS info be recorded? Also, what if the commit ID can be detected, but the working tree is “dirty” or contains extra files? My instinct is that VCS origin info should only be included in the metadata if the directory is being installed via a genuine VCS url as opposed simply to a path, so that e.g. things like dirty working trees wouldn’t affect the install.

@cjerdonek thanks a lot for sponsoring this!

I’m not sure what to do with that part. Do you have concrete suggestions?
Shall I add a sentence in the abstract such as “This PEP and the metadata it specifies should be referred to informally as Direct URL Origin”.

Yes. I’ll clarify that.

I don’t think that’s necessary. More precisely, I don’t see how tools would use it. I’d say we can introduce it later if/when a use case pops up.

Ok, I’ll do that.

Good point, I’ll clarify it must be a string.

I still have doubts about this one, mainly because I don’t visualize the use cases.
Producing a PEP 440 url is feasible by combining the requested revision with the resolved_commit_id as @<revision>#<resolved_commit_id>. It’s true PEP 440 mentions <tag> yet I don’t think it meant to exclude branches or any other kind of revision specifier. If revision was itself provided as a commit id, resolving it to a ref is possibly ambiguous.

In which case would a tool use resolved-ref over revision?

I agree. I’ll see to make that explicit in the spec.

New version at 5546deec, and original post updated.

I see @cjerdonek has handled the PEP sponsorship question (hooray for the increasing number of PyPA folks that are also core developers).

That said, the question does highlight the fact that when we added PEP sponsorship to PEP 1, didn’t consider how it might interact with standing delegations, so it’s currently silent on that topic.

1 Like

Hi folks, as the sponsor of Stéphane’s (@sbidoul’s) PEP, I just want to say I’m now deeming it ready for submission, and Stéphane will be submitting a PR to the PEP’s repo shortly (I’m guessing within a couple days). He’ll be posting to this thread once he’s done so. I believe his branch is current with what he’ll be submitting: https://github.com/sbidoul/peps/tree/source_url-sbi

I want to thank Stéphane for his patience because it took me some time to get back to him on a number of occasions. I was communicating privately with him on some changes, which is what was happening in the meantime. The changes since then I would say are still on the minor side, which is part of why I’m deeming it ready now. Another reason is that there weren’t any outstanding concerns or objections that I can remember from before.

Finally, we both agreed to add myself as a co-author after working on new wording, which is why I’ll be listed in the Author field rather than the Sponsor field. Thanks, Stéphane, for your continued work on this!

1 Like

Thanks Chris for you support and work on this matter.

I submitted the PR at https://github.com/python/peps/pull/1145

2 Likes