PEP 625: File name of a Source Distribution

I hope this is the case, but may have missed that. What were the better ideas we came up on this, exactly? All previous discussions on sdist format seem to have gone nowhere; one main motivation behind this PEP is exactly because the extension change would happen anyway exactly like this, and having it first with a “Metadata Format 0” that specifies nothing is already an improvement even with the one dash rule applied.

Having a new name adds value now, as it would allow pip to implement that “if it’s .sdist, use the simple rules and rely on them” code immediately, offering a means for improving behaviour for direct URLs and local files. Deferring until we have a defined sdist format just delays when pip can do this, at no significant benefit for pip (because pip doesn’t really care about metadata other than name, version and dependencies, and it’s likely that we’d end up defining everything but dependencies if past experience is anything to go by…) Other tools would benefit from a full sdist standard, presumably, so I’m not arguing that we don’t standardise. But as a pip maintainer, I don’t care that much about the second part, but I do care about the filename.

It was a hypothetical future, the point at which we might want to consider switching the extension.

It adds a huge amount of churn though. That added value doesn’t seem worth it, but maybe I’ve just never encountered the problem among all the teams at work that I support.

What is this referring to?

Personally, this is exactly why I stopped trying to push the metadata standardization topic. If we can’t even agree on a file name then trying to standardize metadata feels pointless. Plus if we can agree how we expect tools today to handle old sdists that’s also a tangible benefit now.

And that last case should come with standardizing the metadata. Toss in moving to the zip format and it becomes easier to access that data.

tl;dr: I don’t think that documenting status quo can block us from any improvements down the line, so we should start with that and make changes in later revisions.


Catching up on this thread, looks like most of the discussion is around how we’d change source distributions (rebranding / file names / archive types) etc and… discussions of the trade-offs of those changes.

I think that clearly shows that we want to make improvements here, while being mindful of the impact of it. OTOH, right now, we don’t have a single place/design document to look at for what we’re trying to improve. I think we should be writing a “source distributions 1.0” specification such that the current behaviors of pip / setuptools / twine / warehouse etc are standardized as (almost?) fully compliant to it. All the other changes we want to make can happen in a 1.1 or 2.0 spec.

Whether we like it or not, the tooling has to support existing source distributions in the ecosystem. And if we want to make drastic changes, we have to properly consider the backward compatibility concerns etc, which can become a 2.0 spec. Having a clear spec to be backwards compatible to, instead of implementation-defined behaviors, will make the effort needed to make changes lower as well.

Doing this should also help us avoid getting derailed(?) right now – in trying do a better-job-than-status-quo while status-quo is not clearly described in a single place – and also give us a good base to build upon for those improvements.

4 Likes

My proposed heuristic.

2 Likes

Are you saying that there’s a chance we shouldn’t change the extension when introducing the new format? If so, what’s the scenario? This would need to something be better than “I don’t know since we haven’t designed that yet, but it might”. It is impossible to argue against speculation, and it is the speculating side’s responsibility to prove it, not the receiving end of it.

The proof also would be very needed for the whole sdist discussion; otherwise it will always run in circles: every partial design is shut down because we may want it done otherwise if everything is fully designed, without any indication of what that full design actually is. This is frustrating since the comment is always so easily made without any responsibility to prove it true, while the the other side has no way to object.

Assuming we will have a new extension when the sdist format is done, all churn required now will still be required. Unless you think a new extension would be a bad idea (again, proof required), the work relating to it is unavoidable, and does not change whether we do it now or later.


This is the current specification in its full:

  • The file name is the {name}-{version}.{ext}.
    • {name} may or may not be normalised.
    • {version} may or may not be normalised. It may or may not be compliant to PEP 440.
    • {ext} is either tar.gz or zip.
  • It is an archive. There are files in it.

Honestly I think PEP 517’s Source distribution section is enough as a specification of the current situation.

I’m saying the new format is the speculative part. When we have THAT proposal, we should consider changing the extension.

All we want right now is for names to be valid, which can either be satisfied by fixing the tools that generate names, or by introducing a change that makes all existing names invalid (as well as every existing tool that generates or consumes names).

That’s not hyperbole, that’s the situation. Here’s the hyperbole: users hate arbitrary change, and breaking all of their existing/legacy/legitimate workflows so that pip can be a little more efficient on an edge case (for which there is an easy workaround - rename the file manually) is an arbitrary change that will make them continue to hate us.

2 Likes

Who is “we” here? The PEP authors want to be able to take a filename with no additional information¹, and determine what data we can infer from that filename without doing any IO.

Currently, the answer is “nothing, unless the extension is .whl”. If we know the file came from an index we can reasonably assume .tar.gz is a sdist with a known naming convention², but if we don’t know that, .tar.gz tells us nothing.

Yes, creating a new filename convention is change. Yes, it has an impact. Yes, it’s potentially complex, if we decide to retrospectively change existing sdists on PyPI³. And yes, there will be disruption.

But that change is needed at some point, regardless. Assuming that we’re ever going to standardise sdists, we’ll need to do this. Why can’t we do it now? What is the blocker? Once we have a new naming convention, internal changes to what we store can be managed relatively simply by versioning the contents, so it’s unlikely we end up with multiple major upheavals - any more than a new metadata spec is a major issue for users.

I guess I’m not seeing why people are so opposed to doing this “now” rather than “later”. And what has to happen in order for us to decide it is time to do it.

To be specific, Avoid generating metadata in `pip download --no-deps ...` · Issue #1884 · pypa/pip · GitHub is an example of a bug report directly caused by the fact that pip currently needs to do a build step to confirm metadata (name and version). And I’m a bit tired of saying that we can’t fix it until “we” standardise sdists - specifically of being seen as one of the people responsible for the fact that it’s taking so long to do so. That’s why I’m backing this PEP.

If we’re doing hyperbole, pip’s users already hate us because everything is slow and involves incomprehensible build steps when there’s no clear reason for them. They aren’t interested in odd edge cases, they just expect stuff that seems simple to be simple. Until they hit one of those edge cases, and then they scream at us that it’s essential that pip doesn’t break just because {insert some weird legacy case here} - and we have no standard to point at to say they shouldn’t do that.

I don’t think hyperbole helps much here :slightly_frowning_face:

¹ Specifically, without necessarily knowing it came from a package index.
² But we have code in pip that cross-checks later by doing a build step. Maybe that could be removed, but there’s a risk in doing that and it would be a lot easier to justify if there were a standard to back us up.
³ That’s not essential, it’s just something that’s a potential question to discuss. But as I say, it’s complex, so it’s OK if we skip this bit.

2 Likes

To make pip slightly more efficient, I think the idea discussed in " specifying static metadata that can be trusted" is less disruptive for the users and do not require a change in name:

  • if the name & version are marked as reliable metadata, as will be the case with “modern” builder, trust them
  • otherwise, build the metadata to check; users might complain that it is slow (like it already is) but we’ll be able to suggest them to switch to a “modern” builder.

But yes, you’ll still need to download the sdist to check it (which would not be needed with the name change), but I find this less impacting than a name change.

2 Likes

I could certainly live with that (I still think that ultimately we’ll need a name change, but doing that would mean we could manage a bit longer without).

But I got the impression that discussion was even more badly stalled than this one?

Honestly, I’m losing interest in even trying to progress on improving sdists. It feels as if we’d be better just making an implementation decision in pip and standardising after the fact, once we’ve demonstrated the world didn’t fall apart when we unilaterally declared some behaviours as “not supported” - because the “what if” questions are paralyzing progress. (The downside being that going that route prevents implementing anything that needs backend support).

1 Like

Sorry, I have had less time for packaging things than ever lately so I haven’t caught up with the rest of the thread, but I just wanted to chime in and say that I did not realize that we were targeting a situation where the files weren’t local on disk for these sorts of “it makes it easier to do dependency resolution without opening up the file” improvements. This is very helpful information, and I agree it is very costly to require downloading the file.

In this case, I think a decent middle-ground would be something like this proposed approach, where metadata is exposed as attributes via the API. For “local-folder-as-index” situations, you can introspect the sdist to see if it’s reliable and PyPI can introspect the sdist on upload to determine whether the name is “trustable” (or even to expose the name and versions).

I suspect many PyPI mirrors would not immediately update to expose this information (since many of them only added Python-Requires support years later, if at all), but I think “this breaks when your file version has dashes in it and you are using an index with an out of date standard” would probably be no worse than the current status quo, and the common case would be a significant improvement and end users would start seeing benefits even with no action taken by library maintainers other than updating their version of their build tools. Of course, that would require a coordinated effort between backend, frontend and PyPI maintainers, but as I’ve said here and elsewhere, I think we’re the people who get the most benefit out of this improvement, and that’s much easier than an ecosystem-wide change.

We’re also not targeting getting dependency information (the focus on the filename is specifically limited to project name and version).

The key here is whether tools are allowed to ignore the PEP 517 hooks to get project metadata and assume that a name and version derived from the filename is correct. This is a win if the only metadata you need is the name and version.

The secondary problem is how tools can recognise when the algorithm for determining name and version is applicable. “When we already know it’s a sdist” is as far as we’ve got with that, and that means “when it came from PyPI and has extension .tar.gz”. (We could say “when it comes from a PEP 503 index” if we’re willing to sat that “PEP 503 indexes MUST NOT serve files with a .tar.gz extension unless they are sdists”, but that’s not currently part of the PEP). This is a win if you’re processing local files, direct URLs or arbitrary URLs (all of which pip allows, but other tools may not).

Either way, this would only ever be a performance and maintainability issue, as it would be about allowing tools to omit checking and error handling code that would otherwise be needed to validate the data derived from heuristics, and recover (or fail) gracefully from that mismatch.

But it’s a performance issue that pip users do actually complain about.

@dstufft had a similar proposal in pypa/warehouse#8254. But this only circles the discussion back to where it began. To be able to expose sdist metadata, sdist metadata needs to be reliable, which means an sdist metadata spec needs to be designed, which means we need to standardise sdist. And that discussion (Sdist idea: specifying static metadata that can be trusted) is stalled, so nothing depending on it can happen.

The proposal in PEP 625 is taking what seemed to take one part in the sdist format (the file name) that needs to happen eventually and is not blocked by anything, and do that first. Sure, what it can achieve can also be achieved in other ways if we get to finish the sdist format discussion and everything that currently gets blocked by it, but after that happens, we are very likely going to introduce that file name change anyway, so getting the name change right now would provide the benefits before the discussions finish (or even progress), without affecting what we can do later on.

What I’ve read from every objection seem to all say “we don’t need to do this now because we can do something else.” But we are not doing that something else now, and this still needs to happen after we do that something else. So why is that other thing relevant? That’s what I don’t get.

2 Likes

I think if the entire value proposition is “this should be easy and uncontroversial” and it turns out to be very controversial, then this proposal doesn’t have value.

I would disagree both with the characterization of the extension change that it is necessary even for standardized sdists and with the idea that even if it’s going to happen anyway that it should happen ahead of the rest of the standardization efforts. Even assuming it’s going to happen anyway, it’s actually much more useful as an indicator that a given sdist is a standardized sdist than a situation where we have a mixture of some non-standardized sdists with standardized names and standardized sdists using that name.

Toss in the algorithm we can all agree to in parsing pre-existing sdists to get the name and version out of it and I’m sold. We can then add that algorithm to ‘packaging’ and be done with it.

I personally think the stdist metadata topic is stalled because this topic is also stalled.

I think that’s fair.

My proposal

To breaking this rut we seem to be in, here’s my proposal of what should happen:

  1. The PEP as outlined by PEP 625: File name of a Source Distribution - #92 by uranusjr happens as sdist 1.0 with all tools are expected to follow what PEP 517 says going forward and we document the algorithm of how to parse old names that don’t follow PEP 517 so we can truly say what old sdists are not PEP-compliant (and that algorithm goes into ‘packaging’ so it’s shared)
  2. We figure out exactly what the minimum metadata we would want every sdist to have if we were designing this today gets resolved (both from a pip and PyPI perspective as they would be the consumers of this metadata)
  3. We figure out how we want to store that metadata
  4. We figure out a versioning scheme for sdists in case we need to change stuff again in the future
  5. Anything else we would want sdists to have if we were designing them today? (E.g. pyproject.toml or setup.py must be present to build the sdist, but I suggest we keep it minimal so we get this done)
  6. We write a PEP for sdists 2.0 which encompasses the answers to points 2 onwards which can include any file name changes like .sdist so we have the clear motivation on any file name change

Does that seem reasonable?

1 Like

So just to be clear, there’s no way to tell that an arbitrary file is a sdist prior to sdist 2.0? That would mean that tools have to treat a .tar.gz file as an archived source tree (to use PEP 517 terms) and not a sdist. I’m fine with that, to the extent that it’s a reasonable way forward. It doesn’t offer any practical value to pip that I can see - but that’s OK from a pure standardisation POV.

I’d like to have an additional constraint that consumers may assume that any .tar.gz file served by an index is definitely a sdist (1.0 or legacy) and may ignore files whose names can’t be parsed with the “legacy” algorithm, or which are served from a project page whose project name doesn’t match the file’s project name. That at least allows tools reading package indexes to assume a consistent view of what they see. But if that’s too contentious, then we can just go with what you said.

To avoid just getting stalled again, how do we make sure that steps 2 onwards actually happen? Is anyone willing to champion that work?

Without cracking the tarball open, checking for setup.py or pyproject.toml and then just assuming they will work? Not that I’m aware of. But I also don’t think it’s unreasonable to assume that any tarball in a situation that is used to install a package is meant to act as an sdist.

:+1:

I’m personally fine with that constraint, although I would tweak it by saying that assumption is made if the file name can be parsed.

I mean I plan to keep poking and prodding these ideas if people are okay with the outline I provided to keep the work focused on a single sdist topic at a time. I’m currently not in a position to take on another packaging PEP, though.

2 Likes

That’s true, but I think it’s been implied (by me and Brett and probably Paul?) that we think it’s okay to treat a filename that looks like {name}-{version}.{ext} as actually representing that name and version (including if you have to squint to deal with unnormalised names).

FWIW, PyPI already serves up sdists that contain different versions from what the filename says (I’d link to the example I was looking at the other day but it’s malware and should be gone already :wink: ). But in these cases the fix is obvious (rename the file) and clearly belongs to the packager.

The example that originally sparked the suggestion that we need a new extension was the fact that github archive downloads (I think) can take the form {project}-{tag}.tar.gz and that looks very like the sdist format, but isn’t (also the content is not a sdist but a zipped source tree, although it’s not obvious if pip makes any use of that distinction). And pip can install github archive downloads direct from the URL…

Maybe pip shouldn’t do that, and maybe everyone would support us if we stopped allowing that case, but we wouldn’t be able (as far as I can see) to produce a friendly message explaining the issue if we did. So the transition wouldn’t be easy.

Sigh. There’s too much speculation here about what might and might not happen. As I’ve said, I’m OK with Brett’s proposal - let’s leave the details to anyone who tries to implement something in pip based on it (that someone will not be me, so I’ll stop worrying).

1 Like