Sdist idea: specifying static metadata that can be trusted

Technically build tools don’t need to read pyproject.toml. My understanding is that PEP 621 is not intended to be mandatory. That said, even assuming universal adoption, build tools already need to be able to write METADATA files (to generate the metadata in wheels), but they don’t need to be able to write pyproject.toml files, just read them. Files processing metadata already need the capability to read METADATA files, but not pyproject.toml, so I think this argument actually cuts against pyproject.toml.

We must add a second file (or, I suppose, mutate the first one?), and it will undoubtedly say something different, because many packages will generate metadata at build time rather than having it statically specified in the input.

Even if we mutate the file, it doesn’t alleviate the question of precedence, it just answers it as “the new file takes precedence”. In any case, it shouldn’t matter much, because we should definitely say that a tool is not compliant with the spec if running the tool again with the same pyproject.toml has different values in any “reliable” fields.

Yes, this is the case, but it’s not exactly something I would generally recommend, let alone mandate. To the extent that sdists contain different stuff than a repo checkout, it’s usually that the source distribution leaves out stuff specific to the repo (.gitignore, CI configuration, etc), and sometimes includes generated files (like setuptools_scm generating a _version.py file, or the controversial (anti-?)pattern of including generated C code in an sdist). I think it’s exceedingly rare to mutate existing files. If we were to mutate the pyproject.toml file as part of the inclusion, users looking at the source distribution wouldn’t be able to see the actual source that is used to generate the package!

I think an sdist is both a binary artifact for building from and a source for reading from. As a source for humans to read and because it’s a required configuration file for building the package, the pyproject.toml file will be included anyway — I think that we should include it in its unmodified form, which is something that humans should be able to read easily anyway — and humans don’t necessarily need strict metadata about whether things are reliable or not anyway (if we’ve done our job right designing PEP 621, humans should be able to reasonably easily discern which values are provided by the tool anyway).

I also think that the METADATA files are not terribly difficult for humans to read; they a newline-delimited set of key: value pairs with minimal syntax. If you need to pop into the METADATA file to see what the tool-provided values resolved to, I don’t think you’d have much trouble doing so.

1 Like

This argument is convincing enough for me. I’m quite happy to not add new code to generate another file. I deliberately designed my sdist generation to write and use METADATA as-is, so selfishly I’d be happy to now have to add a new output format :slight_smile:

Yep, you could build two wheels with different versions of numpy, and they would have the same compatibility tags, thus the same filename, but not be compatible with the same versions of numpy.

Numpy is the only case where I can point to specific examples like this, but it doesn’t have any particular special status - it’s just a widely used package with a C API. I wouldn’t be entirely surprised to see something similar happen around PyQt, for instance.

I believe many sdists already contain a file called PKG-INFO, with the same format as the METADATA file in wheels. The latest pip & requests sdists have this, for instance, and Flit writes a very minimal version of it.

So maybe the path of least resistance is to formally specify this? Then we have a lot of projects already complying with the ‘new’ standard. :slight_smile:

4 Likes

It already is: PEP 314 – Metadata for Python Software Packages 1.1 | peps.python.org.

Is anyone actually consuming PKG-INFO? It has also deviated from METADATA in various ways, so directly using it would lead to a potential third file after pyproject.toml and METADATA.

Do we know exactly how PKG-INFO has deviated?

Yes, by comparing PEP 314 to the Core Metadata Specification. The most significant difference from the top of my head is Requires: vs. Requires-Dist: (not just the key, but also the allowed format).

PEP 314 has version 1.1 metadata, but from what I can tell setuptools generates Version 2.1 metadata even for PKG-INFO.

From what I can tell Requires and Requires-Dist are not currently included in PKG-INFO, but the contents seem to be otherwise identical (though for some reason backports.zoneinfo doesn’t have the Description field at the end in the PKG-INFO field, so I’m not actually sure if it would parse properly anyway.

I think that to the extent that PKG-INFO has deviated from METADATA, it is a subset, and possibly more unreliable. I don’t know if anyone is parsing it or relying on its existence, but I suspect that the best way forward is to introduce a new Metadata-Version that includes something like Unreliable-Field: to specify which values are not guaranteed to be the same after a wheel is built. If we use PKG-INFO, we can just specify that in new sdists, PKG-INFO should be in the new metadata format, and then we do not have to worry about any deviation (though I’m still somewhat partial to using .dist-info/METADATA, to give us more room to expand the types of metadata we include in the source distribution; we could presumably make PKG-INFO a link to .dist-info/METADATA, as I think links are allowed in tarballs).

What I’m gathering from this thread is what metadata must be static is:

  1. Name
  2. Version

Toss in metadata version and you have METADATA as specified for wheels. That seems like a reasonable argument for going with METADATA.

Do we need an equivalent of WHEEL? If so do we need more than metadata version and potentially the builder? Is this a place to potentially record where the source code came from? I.e. is there a way to track a built wheel back to the sdist or source that it came from already that I’m not aware of (PEP 610 seems to head in that direction, but I don’t know how it handles installing from a cached wheel and I don’t think there’s a way for an sdist currently to say where it’s source resides)?

Yes.

Regarding the PEP 610 question, I don’t think we need to do something like that? FWIW, PEP 610 is aimed purely at “frontend information on which url this came from”, and I don’t think we need to be propagating information about what sdist a wheel was built from.

1 Like

Would it make sense to add a new field in PKG-INFO that enumerates which other fields are static and can be trusted not to change when preparing metadata?

1 Like

Not sure if this is elided or not, but I think the most crucial improvement we could make is annotation of field reliability, which is why I suggested above that we add an Unreliable-Field: field, or something to that effect (whenever Metadata-Version is > 2.1, we’d assume anything not annotated as such is static). This is necessary to distinguish between “did not specify this field” and “this field was specified but in such a way that we cannot guarantee that it will be invariant between builds”.

I would think we would have to go the other way on this, and list the reliable fields.

1 Like

Which one is the default doesn’t matter to me too much as long as there’s a clear way to discern that we’re making a deliberate choice, and I figured “I’m emitting the metadata version that requires you to annotate all unreliable fields and I didn’t list this field as unreliable” would be sufficient indication.

The main reason I said that we should list the unreliable fields is that I expect the majority of fields are already reliable and as people either migrate to setup.cfg, PEP 621 or backends that use static metadata, we’ll increasingly see that every field is marked as reliable. Similarly, I am hoping we’ll be able to parse setup.py to even consider stuff specified in setup.py as reliable in many cases. I expect in 5-10 years the metadata landscape will be much more reliable, and it will look silly that we’re marking every field as reliable.

Yea, the main reason I like marking fields as reliable is just due to the fact we have a million or more sdists on PyPI that have some level of unreliable metadata in them, and it feels like a bit of a footgun to me to design the API to assume reliable, because it feels very easy for someone not steeped in packaging lore to not realize the importance of checking metadata version prior to assuming reliable, whereas if we assume unreliable then that sort of side steps a particular class of bugs from occuring.

My opinion changes though if we end up using some totally new file that doesn’t have a history of existing in sdists already, then it probably makes sense to be able to say assume reliable unless otherwise stated.

1 Like

But why would you bother specifying unreliable metadata? I didn’t specifically mention something like this because in my mind, if we have new-style sdists write out a METADATA file then what’s in there is guaranteed reliable and static, and something that wouldn’t be is left out. That then means tools would fall back to PEP 517 to query for what they need if it wasn’t provided already, or the build tool does its thing to get the info it needs if it didn’t write it down when it constructed the sdist.

Or put another way, a METADATA file in a new-style sdist would simply record known-knowns and everything left out are known-unknowns for the sdist.

So what would that file have? Builder and sdist version?

And if we are having an SDIST file, should we go a step further and have an SDIST-METADATA file instead of using the METADATA file name in two different binary artifacts (I’m not suggesting a file format change, just not reusing the file name)?

Another argument for Unreliable-Field (or the equivalent for reliable fields) is to distinguish “specified to be missing” and “unspecified” for multiple-use optional fields, e.g. Requires-Dist. It would be very useful for installers to know whether it needs to build a package for dependency information (i.e. Requires-Dist being listed as unreliable), or the package has a static but empty list of dependencies. This is analgous to PEP 621’s dynamic.

1 Like

We wouldn’t necessarily need to include the unreliable data, we just need a way to distinguish between “reliably unspecified” and “specified but unreliable”. As @uranusjr points out, this is particularly important with repeated-use fields like Requires-Dist where the absence of the field should indicate that no dependencies were specified.

If your suggestion was that we would only include fields that are required to be reliable in the metadata file, then that changes the discussion drastically, and changes it into something I would consider much less useful. As I mentioned in my response to Donald, I foresee a transition from most metadata being “unreliable” to most metadata being reliable as we improve ways of specifying static metadata and detecting statically-defined metadata in a setup.py. Having a marker for which fields have undergone that transition will make it much easier for pip and other tools to get significant incremental improvements as adoption increases.

Ah, fair enough. That makes sense to me coming from that perspective.

It was, but Tzu-ping’s suggestion clarified the need for it. I guess the word “unreliable” suggests to me as not being worth the effort, while saying “it’s purposefully left out as it will never show up” makes more sense to me.