Sdist idea: specifying static metadata that can be trusted

brettcannon · July 14, 2020, 1:19am

For me, sdists are a way to provide what is necessary to build a wheel (see Purpose of an sdist). That means, to me, that an sdist is a build artifact, albeit one that is part way between source code and a wheel. As such, an sdist is more than just an archive of source. That suggests to me that we could have static metadata up to what isn’t specified as part of a wheel build.

Historically, though, there hasn’t been any metadata that could be trusted in an sdist as there hasn’t been a spec on how to handle it. As such that seems to be one of the stumbling blocks of standardizing sdists (Purpose of an sdist being the other; one could argue a manifest of included files, but I will leave that to others who care more about that sort of thing). So this topic is about two potential solutions I see to the “static metatadata problem” of sdists.

Stick in `METADATA`

If you look at what goes into https://packaging.python.org/specifications/core-metadata/ you will see that almost everything there could be specified statically in an sdist. Probably the only thing you can’t specify is Supports-Platform. Otherwise we already have tools that know how to process a METADATA file. It also means that wheel building already has a chunk of work done for it already and the file can just be appended to for the final .dist-info/METADATA.

PEP 621 w/ no `dynamic`

Nick proposed in PEP 621: Storing project metadata in pyproject.toml the idea of having an sdist have pyproject.toml written out with the static info the sdist build tool knows. To me that means writing out what PEP 621 specifies but with all the details listed in dynamic written back out to pyproject.toml.

I personally like this as it makes sdists more static than a source checkout, but it doesn’t overly inhibit what an “sdist” is to people (e.g. include Cython-generated code?). It also keeps the metadata easy to read if one digs into the sdist to look at things.

dstufft · July 14, 2020, 4:50am

So I don’t think that sdists can have purely static metadata without it being a regression in terms of features that projects are actually able to use. That’s not a wrong thing exactly, it’s a trade off and one we should make explicitly.

The main thing that people typically modify in a sdist is dynamically selecting dependencies, a lot of the has been replaced with environment markers, but I don’t think all of it has. One example that came out of previous discussions is that with NumPy you can depend on one set of versions at an API level, which is what you’d want to do in an sdist… but then once you build the sdist the wheel ends up being further constrained by the version you happened to build against. Something like that means you can’t really trust the dependency metadata inside of a wheel, even if it can be declared statically in terms of input to the build system, it does not mean that you can make assumptions that those same dependencies will line up 1:1 in terms of what gets emitted to the wheel.

It’s also possible that other metadata might need to be adjusted? There’s some that certainly should not be allowed to be dynamic, such as name and version number and such. So one thing we could certainly do is further constrain the fields that are allowed to be dynamic without eliminating them completely (perhaps we could even whitelist the fields that are allowed to be dynamic inside of a sdist, since it’s easier to start allowing a field than it is to remove it later, and a good goal is to make as much of it static as possible).

As to the two options, given that I don’t think all of the metadata can always be static, and that sdists are effectively used as inputs into a build system to produce binary artifacts (whether those are wheels or something else), it makes sense to me to have sdists use PEP 621, with a limited whitelist of what attributes can be dynamic (I would also include some additional, generated metadata, like sdist version, etc similar to what we do in wheels so that tools can actually check if they support the features this particular sdist requires of the packaging toolchain).

As a side note, I’ve always thought that if we ever standardized sdists, it would be a good idea to just leave the .tar.gz extension in the past, and start calling them “source wheels”, and give them an extension like .src.whl or .swhl or something. This allows us to align our terminology with wheels better, and it also gives us an opportunity to be more strict about the actual internal shape of these archives, instead of the YOLO that currently exists. It also acts as a hard barrier for the old, unversioned style versus the new versioned style which makes it easier to evolve in the future.

ncoghlan · July 14, 2020, 9:00am

There’s a 3rd option to consider, which is to allow sdists to contain a wheel-style dist-info directory.

The advantage of this is that if a project can generate completely static metadata for inclusion in the sdist, then https://www.python.org/dev/peps/pep-0517/#prepare-metadata-for-build-wheel processing could be amended to allow taking it from the sdist without needing to set up a build environment first.

pf_moore · July 14, 2020, 9:12am

One downside of this is that build tools would either need to change to pick up metadata from the METADATA file when building from sdist to wheel, or someone (build tool or front end) would need to do a sanity check that building the wheel didn’t change the metadata. (I’m assuming that we are all on board with the idea that metadata in the sdist and wheel MUST be the same - otherwise what’s the point of having metadata in the sdist at all?)

This sounds much better, as build tools just have to respect PEP 621, which we’d want them to do anyway. And it somewhat unifies the process of getting metadata from sdists and source checkouts, making it easier for tools to handle both.

Please, no. If we invent the term “source wheel”, we immediately confuse the meaning of the term “wheel” (it no longer equates to “binary”). I can imagine conversations with projects “please provide wheels for platform X” - “we do, we provide source wheels for that platform” - “no, I mean…” I’ve personally always found this to be a frustrating terminology clash with RPMs - RPM vs SRPM.

I’m not averse to changing the format, or the extension, or the name. But please, let’s not call them (a variation of) wheels.

IMO, that’s mostly just an extension of the “stick it it METADATA” approach. Which is possible, but see above regarding whether we’d want tools to use that metadata when building the wheel, or add checks that the two sets of metadata were the same.

(In practice, I’m mostly ambivalent between METADATA and pyproject.toml - for my personal needs, either is fine. I’m just trying to channel backend developers in my comments, and I’ll happily defer to them if they want to add their own views directly).

ambv · July 16, 2020, 5:41pm

4 posts were split to a new topic: Sdist metadata: Store in special fields?

brettcannon · July 14, 2020, 9:28pm

So what exactly is that metadata? I don’t think I have ever seen an explicit list put forward anywhere of what metadata should be static from a tool perspective. Name and version are seemingly straight-forward since that’s needed for the file name anyway. But does it simply stop there? Or is there more which would be useful and reasonable to have be static?

I purposefully didn’t propose that so as to not be leading or get into an discussion as to whether reusing .dist-info as a name would cause issues, but I somewhat assumed this would end up being the case. (Same reason I didn’t suggest having a RECORD file even though I know some security-conscious folks will probably want that.)

I personally assume so. My assumption is source → cheddar → wheel leads to more and more static metadata, and once it’s static it’s static/unchanging.

Cheese Shop sketch - Wikipedia and I’m not going passed that to make sure we don’t bikeshed on this thread (someone can start their own topic if they truly want to argue about what something should be called).

pf_moore · July 14, 2020, 10:39pm

Dependencies and Python-Requires. These are crucial for any resolver algorithm. With markers, I’d hope that modern projects could specify these statically (and if there’s a use case where they have to be computed at build time, we should consider whether we can add a new marker to address that).

But my question would be, are we trying to specify just a bare minimum that we have a known use case for? Or are we trying to make it easier for someone with a sdist and an as yet unknown use case, to introspect the package metadata?

If we want to make access to metadata in general easier, I’d turn the question around and ask what metadata can’t be determined at sdist-production time? The only one I can see (from checking the spec) is Supported-Platform, and I’m going purely off the spec there as I’ve no idea what it’s used for in practice.

Of course, the real difficulty here is setuptools, which is so highly dynamic that anything we try to mandate as static will give them a problem. But I’d hope that a combination of focusing on how people actually specify their metadata, plus normal transition processes, would allow them to move to a situation where projects would have to “opt in” to fully dynamic handling. So I’d prefer not to have standardisation progress blocked because of what people “might do” with setuptools. We still need to look at actual use cases, if anyone has something specific, of course. And if the implementation difficulties for setuptools are too severe, we might need to reconsider. But let’s start by assuming the issue is solveable.

steve.dower · July 15, 2020, 7:54am

I’d prefer the dist-info directory, and tar or zip doesn’t bother me (let Paul and Donald represent the front ends on that question).

I guess I’m also in the camp that would like A RECORD file (and one day a cryptographic signature), and also explicit metadata that identifies the original source, such as a git URL and commit.

dstufft · July 15, 2020, 1:30pm

Well like I said above, dependency is a bit weird. I’m assuming by static you mean both static AND it will match exactly what a resulting wheel would have, which in that case I don’t think we can, at least not without implicitly dropping support for certain paterns that are in used today. I mentioned it above, but NumPy has both an API and an ABI that is attempting to be expressed in dependency information, so you can have a project that can be built against one range of versions, but once it’s been built against a specific NumPy than this dependency specifier must be further constrained to be >= whatever version of NumPy it was built against.

pf_moore · July 15, 2020, 1:57pm

Thanks, I’d forgotten you’d mentioned that above.

Do you have (or does anybody have) a precise explanation of what’s going on here? In terms of an existing PyPI project that distributes wheels with different dependencies, so we can understand the issue in concrete terms?

To be honest, if we can’t read dependency data from the sdist without going through a PEP 517 build step, then I have no real interest (from pip’s point of view) in sdist metadata. As long as we standardise the filename so that getting name and version is reliable, I’m only interested in dependencies.

Without dependency data I’d rather go back to pushing to get PEP 625 approved.

dstufft · July 15, 2020, 2:11pm

Someone else might be able to chime in with more specifics, but my understanding is that if you use the NumPy C API, you’ll produce a .so that links against NumPy, and the NumPy ABI guarantees that something built against a particular version will continue to link successfully against later versions, but it has no such promises for linking against older versions.

This is a pretty common pattern in C libraries AFAIK? Like not even specific to the Python ecosystem, just C in general, which is a big part of why manylinux builds against old versions of things.

dstufft · July 15, 2020, 2:18pm

I forgot to say, one of the interesting things to me about using PEP 621 style metadata for sdists, is it can improve the situation for pip, without blocking support for things like what NumPy is doing.

PEP 621 has an explicit mechanism for marking a field as dynamic, If we put that into a sdist (but whitelist which fields are allowed to be dynamic inside a sdist) then given a specific sdist, we can determine whether we can use that metadata or if we have to build the wheel first. Thus for the common case, dependency data will be static inside of a sdist, but in the uncommon cases it will not be.

Even better about that, is it gives us an explicit marker that we can introspect for future refinements, let’s say we do this and we notice that everyone who depends on tensorflow is not using static dependency information, we look and notice it’s because they want to select dependencies based on CPU capabilities. We can then use that to guide us to create additional environment markers for CPUs.

pf_moore · July 15, 2020, 3:11pm

Sorry - that one I understand, but I don’t think it relates to Python dependency metadata, unless I’m missing something. Let’s wait for someone who knows the details to clarify.

@brettcannon’s original post suggested “PEP 621 w/ no dynamic”. Maybe we need to revisit that and allow for “dynamic”? It sort of re-opens the discussion from the PEP 621 thread of whether we expect metadata consumers to be reading the data from pyproject.toml, but maybe we need to accept that…

I’d still like to hear from setuptools, though. With

description = "Something or other"
setup(name="xxx", version="1.0", description=some_var)

it’s not entirely obvious how setuptools could even say that description isn’t dynamic. Even asserting that name and version aren’t dynamic would involve checking that a literal was passed, and I’m not sure that’s possible.

So I’d like to know whether we’re even discussing something that can be implemented, before getting too stuck in the details. @pganssle @jaraco any comments?

takluyver · July 15, 2020, 4:01pm

It’s relevant because you can build an extension module linking against numpy as part of a Python package. This isn’t just a weird corner case - a lot of scientific packages with a compiled component will do this. Numpy has a documented function to support it.

Take h5py as an example (because I’m familiar with it). Its source expresses a dependency on numpy>=1.7, both for build and runtime, and it builds multiple extension modules (using Cython) which link against numpy. But if I build a wheel with numpy 1.19.0 (the current version), that wheel will only work with numpy>=1.19.0 (I believe).

I don’t think we’re representing this properly in metadata for h5py at the moment - like many projects, we don’t tend to change packaging code when it seems to be working, and if the official wheels are built with an old version of numpy, people will rarely see a problem. But I wouldn’t want to close the door on doing that properly, which would require determining Requires-Dist at build time.

pf_moore · July 15, 2020, 6:05pm

Thanks, that’s a useful example. Am I right in thinking that the resulting wheels would have exactly the same name, but different dependencies and binary compatibility? That’s a scenario that has all sorts of “interesting” implications that I can’t really work through right now…

pganssle · July 15, 2020, 7:08pm

Note: In this post, I use “dynamic” to mean “not static” and “static” to mean that once the sdist is built the value is fixed; something like setuptools_scm-generated versions would qualify as “static” by this definition, since although the value is not a fixed literal in the source code, it is fixed for any given sdist. It may be worth it to switch from saying “static” and “dynamic” to “reliable” and “unreliable” in this context, since the concept of dynamic fields we care about in this situation is actually a subset of “dynamic” as used by PEP 621.

I think that if we want this to be adopted in a way that will be useful, we’ll need to allow for dynamic fields especially for dependencies. Although it’s not ideal, that’s one of the main places that people do any sort of trickiness with metadata, and like Donald said above we do gain advantages from providing metadata that explicitly says which fields are dynamic, since it would provide a reliable way to determine whether or not you need to do the PEP 517 build step.

I think that as long as we’re allowing anything to be dynamic, we should probably default to allowing everything to be dynamic. I think there are a lot of packages out there that do weird stuff in their setup.py, including many common anti-patterns that have never caused problems before and so they’ve gotten wide usage. For any field we decide to mark as “must be static” it will mean that anyone who is dynamically generating that field will be blocked from upgrading at all, whereas if we allow any field to be dynamic, we can upgrade everyone right away and at least we’ll know which fields have been specified statically and which have been specified dynamically (we’re basically no worse off if a given sdist specifies all fields as dynamic — that’s the status quo).

This relates to why I think “dynamic” will be required. My thinking was that we can start by marking anything coming from setup.cfg or pyproject.toml (after support for that is added) as static and everything else as dynamic. Probably we could also get known-reliable static metadata from things specified in code by parsing the AST of the setup.py and looking for a set of known-safe scenarios. In your example, we could see that description = "Something or other" is a literal and there’s no opportunity for description to be modified before setup() is called. I would guess that we’ll capture a good fraction of common use cases with this, though a lot of packages will be too complicated for whatever simple “static value detection” algorithms we intend to use.

Probably the most common issue with any sort of AST parser will be that we’re too conservative and we fail to see that something like this is actually static:

def get_readme():
    x = "literal_string"
    return f"{x}"

setup(description=get_readme())

Though we also have the added wrinkle that even if we can guarantee the inputs to setuptools.setup() are static, people can define their own command classes, and have many places where they could modify the existing metadata, so no matter how conservative we are with the “static value detection” algorithm, it will always be possible that a value specified as static is actually dynamic.

One thing we could do to make it so that the sdist metadata reliably tells you what a wheel build will do, though, would be to make it so that when building from an sdist, setuptools first reads the standardized metadata file and anything not specified as “dynamic” is unconditionally used for the wheel build. That would be a breaking change in setuptools, but a pretty obscure breakage not likely to affect many packages directly.

pganssle · July 15, 2020, 7:38pm

I will also say that I don’t like the idea of using PEP 621 (with or without dynamic fields) for several purposes:

We already have a standardized metadata format that many tools are compatible with and must continue to be compatible with. It is not worth it to add in a second metadata format when all we really need is an additional field that specifies which fields contain reliable metadata.
I think that sdist metadata should be something primarily edited (and to a lesser extent read) by machines rather than humans, whereas PEP 621 is an input format designed to be used by humans. Making them the same format will likely lead to unnecessary tensions in the designs.
Even allowing dynamic fields, we’d still need to make modifications to the PEP 621-related fields in the pyproject.toml to fix the values for things that are tool-provided but fixed at build time — “dynamic” fields that are “reliable”. This will likely lead to more confusion and complication, since either we’re including a different pyproject.toml in your source distributions than the one in the repo (don’t like that at all), or now we’re adding a metadata.toml file that contains some but not all of the same information. This also has the potential to further confuse the messaging around PEP 621, which right now is confusing enough just as a simple input file.

I also think that if we use the existing, widely-used metadata format, it really helps reduce the scope of this project: the minimum viable set of decisions to make is something like where the file goes, how to mark fields as unreliable and which fields (if any) are forbidden from being unreliable. If we use PEP 621, we have to re-examine all of its design decisions in light of the fact that it’s now being used as a static metadata source rather than a hand-edited input file — and this could either end up needlessly entangling the two projects in case modifications to PEP 621 are required to make it suitable, or it could lead to a divergence between PEP 621 the input file and our PEP 621-like standardized sdist metadata file. By using the METADATA format, we avoid this morass and let the two projects proceed in parallel.

dstufft · July 15, 2020, 8:26pm

One thing that makes me lean towards using PEP 621, is that inside a sdist a build tool is still going to be reading pyproject.toml as an input for producing the wheel, so we can’t eliminate that file, it’s still going to exist no matter what we do.

Which then means our choice largely becomes, do we add a second file to sdists that specify metadata for non-build tools to read? If we do, what happens if these two files say something different?

brettcannon · July 15, 2020, 9:05pm

My suspicion is whatever file the spec says is to be used will automatically take precedence and the other file would be ignored.

But aren’t we already talking about adding new files into a new-sdist? And if PEP 517 has hooks to generate the new-sdist then doesn’t that mean there’s already a chance there will be changed files from the build tool compared to the source the new-sdist was generated from?

To me the choice seems to come down to whether you view new-sdists as another binary artifact that is an input to a tool that leads to a package being installed on a system or not. If you take the binary artifact view then that speaks strongly to .dist-info/METADATA as a new-sdist is just the first/next step towards a wheel. But if you view new-sdists as standard way to package up source code that human beings are reasonably expected to look at and play with, that makes me think that PEP 621 is a better fit for its readability.

dstufft · July 15, 2020, 9:17pm

So would that mean if you’re building a wheel from a dist, and it has version 1.0 in the pyproject.toml and version 2.0 in the METADATA that we would expect the build tool to honor what’s in METADATA?

Sdist idea: specifying static metadata that can be trusted

Stick in METADATA

PEP 621 w/ no dynamic

Stick in `METADATA`

PEP 621 w/ no `dynamic`