Sdists (again): Metadata standardisation incremental update

steve.dower · October 22, 2020, 12:55pm

Guess I’m the only one who gets to ignore it then

Though more likely I’ll have to add some kind of wheel-only metadata callback so that users can modify it at that stage. But I don’t think I have any need for that, so it won’t be a priority (and I’m pretty sure I’m the only user of PyMSBuild right now).

pf_moore · October 22, 2020, 1:12pm

That’s not necessarily “dynamic” in the strictest sense.

If a tool like flit, PyMSBuild or hatch reads a configuration file (which could be pyproject.toml or something else) of constant values, and writes a sdist that contains that configuration file and the PKG-INFO file to the sdist, then when building from the sdist they will read the guaranteed same configuration, and so will certainly generate the same metadata in the wheel. So those backends can accurately say that all of the metadata is static. The key here is that the backend knows its build process, so it can prove that nothing in the sdist can result in changes in the wheel metadata.

Setuptools is fundamentally different, in that it calculates the metadata by running Python code. It therefore has a much harder time proving that “nothing can change between the time I generate this sdist and when it’s used to build a wheel”. But the fact that setuptools is so dynamic is something that they have to deal with in any approach that is looking for meaningful sdist metadata.

Basically, Dynamic is an escape hatch for backends that have a mechanism where the data stored in the sdist isn’t sufficient in isolation to calculate the wheel metadata (so “the environment at wheel build time” is relevant). In practice, setuptools is the only backend that does that, as far as I know. But I don’t consider this as “designing specifically for setuptools” so much as “designing for a requirement that only setuptools supports”.

Put it another way: Static metadata is a limited form of “reproducible build” mechanism. If the sdist contains sufficient information that every wheel generated from that sdist will contain the same value for a given metadata item, that item can (and should) be marked as static. Then tools wanting project metadata relevant to a particular install target can consult the sdist and know that if the sdist says the value is X and it’s static, they can skip doing a build step just to check.

pf_moore · October 22, 2020, 1:19pm

One of the reasons for the spec saying that fields like Author must not be marked as Dynamic is that there’s no valid reason for a project to want to have the author depend on information only available at wheel build time.

I see it as entirely reasonable for backends (setuptools) to mark fields as static as a way of saying “I take this value you gave me for the sdist as something that’s never going to change” even if they can’t absolutely prove that. So setuptools might read metadata from setup.cfg and mark it as static in the sdist even though the project might use setup() to override that dynamically. The “this data is static but it changed” check is then a way for the backend to deprecate such shenanigans without having to identify such code by source analysis (or by pure optimism )

steve.dower · October 22, 2020, 1:28pm

This kind of statement feels presumptuous to me. The main reason I wrote my own backend was because all the rest said “there’s no valid reason for a project to want X” to the point where it was impossible for me to do some of my projects with them

I’m fine with saying “there’s no reason for an introspection tool to assume this field will differ between sdist and wheel, and projects should be prepared to deal with the consequences themselves if theirs does differ”.

All the rest applies. And FWIW I’d prefer to mark dynamic fields rather than static fields.

pf_moore · October 22, 2020, 1:45pm

The list is lifted from the latest PEP 621 cycle, and it’s not something anyone objected to there (maybe just because they were too busy objecting to backdoor standardisation of sdists?)

I feel like we need to make dynamic fields very much the exception rather than the rule if we’re going to get out of the whole “sdist metadata is unreliable” trap. And honestly, I feel like allowing the given fields to be dynamic on nothing more than a general principle is taking flexibility a bit too far. We’ve suffered for long enough in packaging with too much flexibility, I’m trying to incrementally learn some lessons from that. Come up with a valid use case for making any of those fields dynamic, and I’m fine with that, though.

I don’t actually have strong feelings on this, though. For my needs, I only care about Name, Version, Requires-Python and Requires-Dist. And while I’d like Requires-Dist to be static, and in many cases it can be, there’s no hope of mandating that for everything. Argue with me over Author and I really don’t care, though

pf_moore · October 22, 2020, 1:49pm

By the way, my feeling here is that we have broad agreement that the proposal is a good idea, with a certain amount of reasonable and constructive debate over the details. So I’m going to try to pull this together into an actual PEP over the weekend (or if I miss getting to it this weekend, it’ll be a week or two, because I’m snowed under for the next couple of weeks).

pganssle · October 22, 2020, 2:20pm

No, because I think we all agree that in all metadata versions < 2.{2,3}, all fields are automatically dynamic. If we start marking fields as Static in 2.2 or 2.3, then you don’t need to know what metadata version you are working with in order to tell whether the field is static or dynamic (you just need to know whether or not the field is in the list of things marked static). If we start marking fields as Dynamic, then you need to know the version of the metadata spec to know the default.

Here’s both versions when we’re looking at get_static_fields instead:

Marking dynamic:

def get_static_fields(core_pairs: Sequence[Tuple[str, str]]
  ) -> List[str]:
    for key, value in core_pairs:
        if key == "Metadata-Version":
            if Version(value) < Version("2.2"):
                return []
             break

    out = {key for key in core_pairs if key != "Dynamic"}
    for key, value in core_pairs:
        if key == "Dynamic":
            out.remove(value)
    return out

Marking static:

def get_static_fields(core_pairs: Sequence[Tuple[str, str]]
  ) -> List[str]:
    return [field
            for key, field in core_pairs
            if key == "Static"]

ncoghlan · October 26, 2020, 12:13am

FWIW, I think “dynamic as default, explicitly mark static” makes sense for PKG-INFO because it eases the transition for tools that produce metadata: when they migrate to the new metadata version, they can flag the fields they know are static, and leave the rest as “don’t know”.

If they have to declare dynamic fields, then that list also has to include all the “not yet determined to be static” fields.

A “Static” field would also translate more cleanly to wheels, as it would explicitly indicate the fields that are expected to be the same across every wheel for that project version.

pganssle · October 26, 2020, 3:22am

I don’t think we’re interested in breaking backwards compatibility. Nothing suggested here breaks backwards compatibility. If you want to propose a breaking change that switches to JSON-LD, then probably it’s worth doing it in a separate thread.

The difference is down to a quirk in how Python packaging works. Essentially, the way all packaging was done for a long time was that you would execute setup.py <command> to build / install / test / whatever a package. Since setup.py was a file that the user writes, it can have code like setup(install_requires=random.choice(["six", "requests"])in it, meaning that if you execute setup.py twice (or, more commonly, in different build environments), it might give you different results. See this blog post for more background information.

In that situation, the build situation had no way of identifying which fields are “static” (or “reliable”, i.e. specified as literals) and which are “dynamic” (or “unreliable”). What has changed is that we have more build backends and most backends have a way to specify metadata in a static format, e.g. pyproject.toml or setup.cfg. Since there are no user-supplied code paths, the backend knows that the values supplied via these channels are always reliable. There is also the possibility where we can get clever and start parsing the AST of the setup.py and determining which things are specified as literals, to increase the “static” coverage and allow tools to get generic metadata without executing a build.

This thread is about adding a field in METADATA that annotates which fields are reliable/unreliable, so that we can start building tools that would quickly do useful stuff like constructing dependency graphs or resolving an environment without needing to run dozens of builds or anything like that.

Which field do you think applies here? It seems very unlikely that this sort of thing is a common feature, since it’s mostly relevant here for legacy / backwards compatibility reasons.

pganssle · October 26, 2020, 3:37am

For producers, I don’t think it can logically make a difference which side they mark, because the labels are logical complements to one another. The list of dynamic fields is always just the list of fields minus the list of static fields, and vice versa. Build backends should be expected to make a decision about whether or not a field is static at sdist build time, so there’s no possibility that any fields are in a state where they are not static but not yet determined to be dynamic.

If we were allowing for a third “I don’t know” state, then we should explicitly mark both “known static” and “known dynamic”, but that seems over-complicated, in which case the two options are largely symmetrical (except that if we explicitly mark static, the default value for unmarked fields can be the same in all metadata versions).

pf_moore · October 26, 2020, 8:06am

A quick update, by the way. I have prepared the draft PEP for this, and it’s just waiting to be assigned a number by the PEP editors. Once I have a number, I’ll post it for discussion. If you want a preview, it’s here

ncoghlan · October 26, 2020, 10:01am

Programmatically I agree with you, but intuitively “These are the fields you can trust to be static across platforms” feels like a more logical thing for a tool to be emitting than “These are the fields you can’t trust to be static across platforms” (at least to me).

The version independent parsing argument should probably carry more weight, though, as that’s a concrete practical benefit that exists without having to consider what’s more or less intuitive.

I think a relevant difference between this and PEP 621 is that in PEP 621 we were choosing a new default field status, and hence were free to choose “static by default, opt in to declare things dynamic”. With the sdist metadata, “dynamic” is already the default for everything that isn’t encoded into the sdist filename, so it makes sense to have the opt-in be the other way around.

pf_moore · October 26, 2020, 10:12am

I view it as “with this new standard, you can trust the fields in a sdist to be static across platforms (or to tell you that they aren’t)”. So if you start from a position that you can now trust everything, Dynamic makes more sense (to me, at least).

This is the one point that for me weighs in favour of Static. I still prefer the semantics and conceptual implications of Dynamic over the “easier to implement” argument, but if no-one else gets the logic of Dynamic (and I can’t explain it well enough in the PEP to convince people) then I’ll switch to save people a few lines of code…

ncoghlan · October 26, 2020, 10:18am

Switching topics from the opt in spelling, there’s one “must be static” field that I think requires further clarification (both here and in PEP 621): Requires-Python

I’m assuming the intent is that platform specific Python version dependencies should be expressed through environment markers, but neither PEP spells that out.

pf_moore · October 26, 2020, 10:32am

Correct. I have no problem adding a note to that effect to my PEP (when I get the number sorted out). The equivalent is likely to be removed from PEP 621, as the “backends must update pyproject.toml in sdists” requirement wasn’t accepted.

Rather than single out Requires-Python, maybe we could word the clarification more generally, as something like:

Build backends SHOULD encourage users to specify fields statically using requirement markers rather than build-time logic. For fields that must be static in the sdist, backends MUST require environment dependencies to be handled that way.

ncoghlan · October 26, 2020, 1:57pm

Requires-Python is the only field that falls into the second category, though. While License could technically vary by platform, in practice, if you’re going to do that, you’d be more likely to put the platform-specific bits in a platform-specific dependency than you would be to have optional inclusions in the built wheels (and we don’t allow environment markers on the License field anyway).

I do agree it would be a good idea to word any note in a way that reminds readers that it applies to more than just Requires-Python though. Perhaps something like:

The Requires-Python field for a project may vary by target platform, but is required to be static in the sdist metadata. To handle this situation, build backends MUST use environment markers on the Requires-Python field to allow that metadata to remain common across the sdist and all wheel archives, rather than generating platform dependent Requires-Python metadata as part of the wheel build process. Build backends SHOULD also use this approach for other metadata fields that may vary by target platform (e.g. dependency declarations).

xafer · October 26, 2020, 6:14pm

Remember that Requires-Python currently does not support environment markers (cf specifications: Requires-Python doesn't accept an environment marker by xavfernandez · Pull Request #771 · pypa/packaging.python.org · GitHub for context).

pf_moore · October 26, 2020, 7:10pm

I’ve just created PEP 643: Metadata for Package Source Distributions with the actual PEP. Let’s move further discussions over to that topic.

pf_moore · October 26, 2020, 7:19pm

Ouch.

Do we have any actual examples where Requires-Python is calculated at build time? I’d like to try to keep Requires-Python in the "cannot be Dynamic list, but I fear it may not be possible. But equally, I’d rather not drop it purely on the basis of “just in case”…

ofek · October 26, 2020, 8:53pm

What happens to older clients using new metadata versions e.g. would pip fail to install an sdist with Metadata-Version: 2.2 currently?