Sdist idea: specifying static metadata that can be trusted

One downside of this is that build tools would either need to change to pick up metadata from the METADATA file when building from sdist to wheel, or someone (build tool or front end) would need to do a sanity check that building the wheel didn’t change the metadata. (I’m assuming that we are all on board with the idea that metadata in the sdist and wheel MUST be the same - otherwise what’s the point of having metadata in the sdist at all?)

This sounds much better, as build tools just have to respect PEP 621, which we’d want them to do anyway. And it somewhat unifies the process of getting metadata from sdists and source checkouts, making it easier for tools to handle both.

Please, no. If we invent the term “source wheel”, we immediately confuse the meaning of the term “wheel” (it no longer equates to “binary”). I can imagine conversations with projects “please provide wheels for platform X” - “we do, we provide source wheels for that platform” - “no, I mean…” I’ve personally always found this to be a frustrating terminology clash with RPMs - RPM vs SRPM.

I’m not averse to changing the format, or the extension, or the name. But please, let’s not call them (a variation of) wheels.

IMO, that’s mostly just an extension of the “stick it it METADATA” approach. Which is possible, but see above regarding whether we’d want tools to use that metadata when building the wheel, or add checks that the two sets of metadata were the same.

(In practice, I’m mostly ambivalent between METADATA and pyproject.toml - for my personal needs, either is fine. I’m just trying to channel backend developers in my comments, and I’ll happily defer to them if they want to add their own views directly).

4 posts were split to a new topic: Sdist metadata: Store in special fields?

So what exactly is that metadata? I don’t think I have ever seen an explicit list put forward anywhere of what metadata should be static from a tool perspective. Name and version are seemingly straight-forward since that’s needed for the file name anyway. But does it simply stop there? Or is there more which would be useful and reasonable to have be static?

I purposefully didn’t propose that so as to not be leading or get into an discussion as to whether reusing .dist-info as a name would cause issues, but I somewhat assumed this would end up being the case. (Same reason I didn’t suggest having a RECORD file even though I know some security-conscious folks will probably want that.)

I personally assume so. My assumption is source -> cheddar -> wheel leads to more and more static metadata, and once it’s static it’s static/unchanging. and I’m not going passed that to make sure we don’t bikeshed on this thread (someone can start their own topic if they truly want to argue about what something should be called).

Dependencies and Python-Requires. These are crucial for any resolver algorithm. With markers, I’d hope that modern projects could specify these statically (and if there’s a use case where they have to be computed at build time, we should consider whether we can add a new marker to address that).

But my question would be, are we trying to specify just a bare minimum that we have a known use case for? Or are we trying to make it easier for someone with a sdist and an as yet unknown use case, to introspect the package metadata?

If we want to make access to metadata in general easier, I’d turn the question around and ask what metadata can’t be determined at sdist-production time? The only one I can see (from checking the spec) is Supported-Platform, and I’m going purely off the spec there as I’ve no idea what it’s used for in practice.

Of course, the real difficulty here is setuptools, which is so highly dynamic that anything we try to mandate as static will give them a problem. But I’d hope that a combination of focusing on how people actually specify their metadata, plus normal transition processes, would allow them to move to a situation where projects would have to “opt in” to fully dynamic handling. So I’d prefer not to have standardisation progress blocked because of what people “might do” with setuptools. We still need to look at actual use cases, if anyone has something specific, of course. And if the implementation difficulties for setuptools are too severe, we might need to reconsider. But let’s start by assuming the issue is solveable.


I’d prefer the dist-info directory, and tar or zip doesn’t bother me (let Paul and Donald represent the front ends on that question).

I guess I’m also in the camp that would like A RECORD file (and one day a cryptographic signature), and also explicit metadata that identifies the original source, such as a git URL and commit.

Well like I said above, dependency is a bit weird. I’m assuming by static you mean both static AND it will match exactly what a resulting wheel would have, which in that case I don’t think we can, at least not without implicitly dropping support for certain paterns that are in used today. I mentioned it above, but NumPy has both an API and an ABI that is attempting to be expressed in dependency information, so you can have a project that can be built against one range of versions, but once it’s been built against a specific NumPy than this dependency specifier must be further constrained to be >= whatever version of NumPy it was built against.

Thanks, I’d forgotten you’d mentioned that above.

Do you have (or does anybody have) a precise explanation of what’s going on here? In terms of an existing PyPI project that distributes wheels with different dependencies, so we can understand the issue in concrete terms?

To be honest, if we can’t read dependency data from the sdist without going through a PEP 517 build step, then I have no real interest (from pip’s point of view) in sdist metadata. As long as we standardise the filename so that getting name and version is reliable, I’m only interested in dependencies.

Without dependency data I’d rather go back to pushing to get PEP 625 approved.

Someone else might be able to chime in with more specifics, but my understanding is that if you use the NumPy C API, you’ll produce a .so that links against NumPy, and the NumPy ABI guarantees that something built against a particular version will continue to link successfully against later versions, but it has no such promises for linking against older versions.

This is a pretty common pattern in C libraries AFAIK? Like not even specific to the Python ecosystem, just C in general, which is a big part of why manylinux builds against old versions of things.

I forgot to say, one of the interesting things to me about using PEP 621 style metadata for sdists, is it can improve the situation for pip, without blocking support for things like what NumPy is doing.

PEP 621 has an explicit mechanism for marking a field as dynamic, If we put that into a sdist (but whitelist which fields are allowed to be dynamic inside a sdist) then given a specific sdist, we can determine whether we can use that metadata or if we have to build the wheel first. Thus for the common case, dependency data will be static inside of a sdist, but in the uncommon cases it will not be.

Even better about that, is it gives us an explicit marker that we can introspect for future refinements, let’s say we do this and we notice that everyone who depends on tensorflow is not using static dependency information, we look and notice it’s because they want to select dependencies based on CPU capabilities. We can then use that to guide us to create additional environment markers for CPUs.

1 Like

Sorry - that one I understand, but I don’t think it relates to Python dependency metadata, unless I’m missing something. Let’s wait for someone who knows the details to clarify.

@brettcannon’s original post suggested "PEP 621 w/ no dynamic". Maybe we need to revisit that and allow for “dynamic”? It sort of re-opens the discussion from the PEP 621 thread of whether we expect metadata consumers to be reading the data from pyproject.toml, but maybe we need to accept that…

I’d still like to hear from setuptools, though. With

description = "Something or other"
setup(name="xxx", version="1.0", description=some_var)

it’s not entirely obvious how setuptools could even say that description isn’t dynamic. Even asserting that name and version aren’t dynamic would involve checking that a literal was passed, and I’m not sure that’s possible.

So I’d like to know whether we’re even discussing something that can be implemented, before getting too stuck in the details. @pganssle @jaraco any comments?

It’s relevant because you can build an extension module linking against numpy as part of a Python package. This isn’t just a weird corner case - a lot of scientific packages with a compiled component will do this. Numpy has a documented function to support it.

Take h5py as an example (because I’m familiar with it). Its source expresses a dependency on numpy>=1.7, both for build and runtime, and it builds multiple extension modules (using Cython) which link against numpy. But if I build a wheel with numpy 1.19.0 (the current version), that wheel will only work with numpy>=1.19.0 (I believe).

I don’t think we’re representing this properly in metadata for h5py at the moment - like many projects, we don’t tend to change packaging code when it seems to be working, and if the official wheels are built with an old version of numpy, people will rarely see a problem. But I wouldn’t want to close the door on doing that properly, which would require determining Requires-Dist at build time.

1 Like

Thanks, that’s a useful example. Am I right in thinking that the resulting wheels would have exactly the same name, but different dependencies and binary compatibility? That’s a scenario that has all sorts of “interesting” implications that I can’t really work through right now…

Note: In this post, I use “dynamic” to mean “not static” and “static” to mean that once the sdist is built the value is fixed; something like setuptools_scm-generated versions would qualify as “static” by this definition, since although the value is not a fixed literal in the source code, it is fixed for any given sdist. It may be worth it to switch from saying “static” and “dynamic” to “reliable” and “unreliable” in this context, since the concept of dynamic fields we care about in this situation is actually a subset of “dynamic” as used by PEP 621.

I think that if we want this to be adopted in a way that will be useful, we’ll need to allow for dynamic fields especially for dependencies. Although it’s not ideal, that’s one of the main places that people do any sort of trickiness with metadata, and like Donald said above we do gain advantages from providing metadata that explicitly says which fields are dynamic, since it would provide a reliable way to determine whether or not you need to do the PEP 517 build step.

I think that as long as we’re allowing anything to be dynamic, we should probably default to allowing everything to be dynamic. I think there are a lot of packages out there that do weird stuff in their, including many common anti-patterns that have never caused problems before and so they’ve gotten wide usage. For any field we decide to mark as “must be static” it will mean that anyone who is dynamically generating that field will be blocked from upgrading at all, whereas if we allow any field to be dynamic, we can upgrade everyone right away and at least we’ll know which fields have been specified statically and which have been specified dynamically (we’re basically no worse off if a given sdist specifies all fields as dynamic — that’s the status quo).

This relates to why I think “dynamic” will be required. My thinking was that we can start by marking anything coming from setup.cfg or pyproject.toml (after support for that is added) as static and everything else as dynamic. Probably we could also get known-reliable static metadata from things specified in code by parsing the AST of the and looking for a set of known-safe scenarios. In your example, we could see that description = "Something or other" is a literal and there’s no opportunity for description to be modified before setup() is called. I would guess that we’ll capture a good fraction of common use cases with this, though a lot of packages will be too complicated for whatever simple “static value detection” algorithms we intend to use.

Probably the most common issue with any sort of AST parser will be that we’re too conservative and we fail to see that something like this is actually static:

def get_readme():
    x = "literal_string"
    return f"{x}"


Though we also have the added wrinkle that even if we can guarantee the inputs to setuptools.setup() are static, people can define their own command classes, and have many places where they could modify the existing metadata, so no matter how conservative we are with the “static value detection” algorithm, it will always be possible that a value specified as static is actually dynamic.

One thing we could do to make it so that the sdist metadata reliably tells you what a wheel build will do, though, would be to make it so that when building from an sdist, setuptools first reads the standardized metadata file and anything not specified as “dynamic” is unconditionally used for the wheel build. That would be a breaking change in setuptools, but a pretty obscure breakage not likely to affect many packages directly.

1 Like

I will also say that I don’t like the idea of using PEP 621 (with or without dynamic fields) for several purposes:

  1. We already have a standardized metadata format that many tools are compatible with and must continue to be compatible with. It is not worth it to add in a second metadata format when all we really need is an additional field that specifies which fields contain reliable metadata.
  2. I think that sdist metadata should be something primarily edited (and to a lesser extent read) by machines rather than humans, whereas PEP 621 is an input format designed to be used by humans. Making them the same format will likely lead to unnecessary tensions in the designs.
  3. Even allowing dynamic fields, we’d still need to make modifications to the PEP 621-related fields in the pyproject.toml to fix the values for things that are tool-provided but fixed at build time — “dynamic” fields that are “reliable”. This will likely lead to more confusion and complication, since either we’re including a different pyproject.toml in your source distributions than the one in the repo (don’t like that at all), or now we’re adding a metadata.toml file that contains some but not all of the same information. This also has the potential to further confuse the messaging around PEP 621, which right now is confusing enough just as a simple input file.

I also think that if we use the existing, widely-used metadata format, it really helps reduce the scope of this project: the minimum viable set of decisions to make is something like where the file goes, how to mark fields as unreliable and which fields (if any) are forbidden from being unreliable. If we use PEP 621, we have to re-examine all of its design decisions in light of the fact that it’s now being used as a static metadata source rather than a hand-edited input file — and this could either end up needlessly entangling the two projects in case modifications to PEP 621 are required to make it suitable, or it could lead to a divergence between PEP 621 the input file and our PEP 621-like standardized sdist metadata file. By using the METADATA format, we avoid this morass and let the two projects proceed in parallel.


One thing that makes me lean towards using PEP 621, is that inside a sdist a build tool is still going to be reading pyproject.toml as an input for producing the wheel, so we can’t eliminate that file, it’s still going to exist no matter what we do.

Which then means our choice largely becomes, do we add a second file to sdists that specify metadata for non-build tools to read? If we do, what happens if these two files say something different?

1 Like

My suspicion is whatever file the spec says is to be used will automatically take precedence and the other file would be ignored.

But aren’t we already talking about adding new files into a new-sdist? And if PEP 517 has hooks to generate the new-sdist then doesn’t that mean there’s already a chance there will be changed files from the build tool compared to the source the new-sdist was generated from?

To me the choice seems to come down to whether you view new-sdists as another binary artifact that is an input to a tool that leads to a package being installed on a system or not. If you take the binary artifact view then that speaks strongly to .dist-info/METADATA as a new-sdist is just the first/next step towards a wheel. But if you view new-sdists as standard way to package up source code that human beings are reasonably expected to look at and play with, that makes me think that PEP 621 is a better fit for its readability.

So would that mean if you’re building a wheel from a dist, and it has version 1.0 in the pyproject.toml and version 2.0 in the METADATA that we would expect the build tool to honor what’s in METADATA?

Technically build tools don’t need to read pyproject.toml. My understanding is that PEP 621 is not intended to be mandatory. That said, even assuming universal adoption, build tools already need to be able to write METADATA files (to generate the metadata in wheels), but they don’t need to be able to write pyproject.toml files, just read them. Files processing metadata already need the capability to read METADATA files, but not pyproject.toml, so I think this argument actually cuts against pyproject.toml.

We must add a second file (or, I suppose, mutate the first one?), and it will undoubtedly say something different, because many packages will generate metadata at build time rather than having it statically specified in the input.

Even if we mutate the file, it doesn’t alleviate the question of precedence, it just answers it as “the new file takes precedence”. In any case, it shouldn’t matter much, because we should definitely say that a tool is not compliant with the spec if running the tool again with the same pyproject.toml has different values in any “reliable” fields.

Yes, this is the case, but it’s not exactly something I would generally recommend, let alone mandate. To the extent that sdists contain different stuff than a repo checkout, it’s usually that the source distribution leaves out stuff specific to the repo (.gitignore, CI configuration, etc), and sometimes includes generated files (like setuptools_scm generating a file, or the controversial (anti-?)pattern of including generated C code in an sdist). I think it’s exceedingly rare to mutate existing files. If we were to mutate the pyproject.toml file as part of the inclusion, users looking at the source distribution wouldn’t be able to see the actual source that is used to generate the package!

I think an sdist is both a binary artifact for building from and a source for reading from. As a source for humans to read and because it’s a required configuration file for building the package, the pyproject.toml file will be included anyway — I think that we should include it in its unmodified form, which is something that humans should be able to read easily anyway — and humans don’t necessarily need strict metadata about whether things are reliable or not anyway (if we’ve done our job right designing PEP 621, humans should be able to reasonably easily discern which values are provided by the tool anyway).

I also think that the METADATA files are not terribly difficult for humans to read; they a newline-delimited set of key: value pairs with minimal syntax. If you need to pop into the METADATA file to see what the tool-provided values resolved to, I don’t think you’d have much trouble doing so.

1 Like

This argument is convincing enough for me. I’m quite happy to not add new code to generate another file. I deliberately designed my sdist generation to write and use METADATA as-is, so selfishly I’d be happy to now have to add a new output format :slight_smile:

Yep, you could build two wheels with different versions of numpy, and they would have the same compatibility tags, thus the same filename, but not be compatible with the same versions of numpy.

Numpy is the only case where I can point to specific examples like this, but it doesn’t have any particular special status - it’s just a widely used package with a C API. I wouldn’t be entirely surprised to see something similar happen around PyQt, for instance.