Sdists (again): Metadata standardisation incremental update

The PEP 621 discussion is getting derailed by questions of sdist standardisation. Rather than cause PEP 621 to fail because of unrelated issues, I’d like to propose a very focused change to the sdist standard(s) to attempt to make sdist metadata usable.

We already have a standard that mandates that a sdist has a PKG-INFO file containing metadata. I propose the following changes:

  1. Update the spec to say that PKG-INFO MUST conform to [the current metadata standard] (https://packaging.python.org/specifications/core-metadata/). PEP 517 mandates metadata version 1.2, so this is a long overdue change anyway.
  2. Add a metadata field Dynamic modeled on the PEP 621 dynamic field.
  3. The Dynamic field is multiple use and contains any medatdata field name EXCEPT ones on the following list: Name, Version, Summary, Description, Requires-Python, License, Author, Author-email, Maintainer, Maintainer-email, Keywords, Classifier, Project-URL.

The Dynamic field could be added to the core metadata spec, or it could be a sdist-specific extension. I don’t really mind, but I think the core metadata spec technically disallows extra fields, so adding it to the spec would be easier.

The semantics of the Dynamic field would be that it is only allowed in a sdist, and that it is multiple-use and contains a metadata field name. Wheels built from a sdist MUST NOT add or change the value of metadata fields that are not listed in Dynamic.

I think we’ll need a way of flagging sdists that conform to the above spec, as it’s not backward compatible (current sdists don’t effectively allow every field to be dynamic, but don’t say so). Assuming we add Dynamic to the core metadata as metadata version 2.2, this can just be “sdists that write metadata version 2.2 and later”.

If this idea is broadly acceptable, in particular to the build backend maintainers (cc @pganssle, @takluyver, @sdispater, @steve.dower) then I’ll write it up as a formal update to the core metadata spec and a PEP. It would need a PEP-delegate - @pganssle would you be willing to take on that role?

Sure. Though that may be putting your finger on the scales a little bit, since I was pushing for something essentially equivalent in the last go-round :wink:.

I was in fact just thinking last night of suggesting that your points in the earlier thread make sense as additions to the core metadata spec:

Which also dovetails nicely with PEP 621, which is intended to be an input file that maps cleanly to core metadata!

So +1 to the general idea, and I’m happy to be PEP delegate to work out the details.

lol, I was actually thinking that as the person most uncomfortable with doing this in PEP 612, I could trust you to keep me honest! I’d completely forgotten (and missed on my re-skim of the threads) that you were advocating this last time.

Thanks, let’s see what others think now.

This sounds good to me. Means no changes to pymsbuild at all (until PEP 621 changes reading metadata) :slight_smile:

As a new backend maintainer, this seems totally fine with me! Just FYI there’s already an open 2.2 update for SPDX licenses: https://www.python.org/dev/peps/pep-0639/

A few questions:

  1. What happens to older clients using new metadata versions e.g. would pip fail to install an sdist with Metadata-Version: 2.2 currently?
  2. For wheel builders executing in the context of an sdist, if PKG-INFO has no dynamic fields then no core metadata from pyproject.toml would be used right?

I think we should invert the “dynamic” field, and say “Static” instead, and list the fields that are known to be static. That would also solve the issue of backwards compatibility. We assume all fields are dynamic unless they’ve been flagged static, which means that all existing sdists are still valid, they’re just “fully dynamic” sdists, and tooling that knows specific fields are static can emit it.

6 Likes

Does that solve anything other than backwards compatibility? I’m not sure giving up the 1-1 mapping of fields is worth it.

Well I think it matches reality better too? Like if this were a brand new thing, then sure it would make more sense to map the fields 1-1.

However, if we assume all fields are dynamic unless otherwise specified, then we eliminate a class of bugs. If a field is static, and you mistakingly treat it as dynamic, there’s no real harm in terms of correctness (just performance), since a truly static field will have the same value even if treated as dynamic.

On the other hand, if you mistakenly treat a dynamic field as static, then you’ve potentially just broke the build, and for weird non obvious reasons that end users are not likely ever going to notice.

I also don’t think the 1-1 mapping is that personally compelling. The METADATA format is a format designed primarily for tools to produce, and other tools to read, it’s not something that your typical end user is ever going to have to deal with or even really be aware of it’s existence, so for end users I don’t think that there is a super compelling need to worry about adding a slightly different concept. For the subset of the ecosystem that works on these tools, it will be slightly more work/mental overheard, but I think the amount of that is fairly minimal, and I think it’s worth it to eliminate (or at least make it much harder) for certain types of bugs to occur.

2 Likes

Just to add to this, if you look at the METADATA that is currently produced, it’s not exactly a 1-1 mapping. The way extras are specified, for example, would be bonkers as an input format, and is a lot more complicated to produce than listing all static fields.

Also, I wasn’t even thinking that you’d do a 1:1 mapping from PEP 621 at all. The PEP 621 pyproject.toml file might mark a field as dynamic, but the backend can fix it to a static value at sdist build time if it knows the value is invariant. This is what I’d expect setuptools_scm + setuptools to do, for example: when doing an sdist build you must defer to the backend, but if the built artifact has a constant version number, the field should be marked as static.

That said, I don’t know that marking static or marking dynamic matters much. I would think that we would need to bump the metadata version anyway to add the new “is it dynamic” / “is it static” field, since we’re introducing this concept de novo in the new metadata version; consumers of METADATA files should assume dynamic (or at least ambiguous) for Metadata Version <= 2.1. Any consumer that doesn’t know about Version 2.2 won’t care about static or dynamic, and any backend that knows to emit Version 2.2 should also know whether static or dynamic fields need to be marked.

1 Like

Rather than dynamic/static, aren’t we only missing a “deliberately blank” value?

If the key is present, it’s static, and if it’s absent, it could be statically empty or waiting for an actual build. Having a value (__empty__?) that marks it as known blank should cover the actual problems we’re worried about.

I can see two reasons why we might want an annotated concrete value rather than just “missing” and “not missing”:

  1. Ideally we’d use the exact same spec for wheels and source distributions. Wheels need concrete values in them, but you may want to know if the data in the wheel is reliable (i.e. “if I built this wheel again, would I always get the same answer?”).
  2. Having a concrete value marked as dynamic would allow the concrete value to serve as a hint as to what the dynamic value will return, similar to __length_hint__. You could imagine something like pip reading the concrete dynamic values for dependencies, spawning a new process to do a build to get the correct values, and then in parallel downloading the resources it needs assuming that the dynamic value returns the same as the included concrete values. In most cases, this would probably end up being right anyway (± a package or two).

All existing sdists are valid anyway, because they don’t use metadata version 2.3 (or whatever version adds “Dynamic”/“Static”).

I’m in favour of “Dynamic” precisely because it’s opt-in. It makes a strong statement that we prefer everything to be static, and metadata that differs between sdist and wheel is the exception rather than the norm.

Who’s “you” here? The backend? All you’re saying then is “if the backend has a bug, it returns the wrong results”.

This is the one way in which this proposal is strictly worse than the idea of having backends update pyproject.toml, though - because there’s nothing to prevent a backend writing X in the sdist metadata and Y in the wheel metadata. If the data were in pyproject.toml, the build of the sdist would have to use that data, as pyproject.toml is also the input format. So in this proposal, backends need to be sure to check that they do respect the data that they wrote to the sdist as static - they don’t get that for free. But that’s true whether the proposal is for Static or Dynamic.

In practice, I don’t really care much between Static vs Dynamic. I still intend to propose Dynamic in the PEP, but if there’s a very clear consensus that everyone else wants Static, I can change.

Oh, that’s what they were talking about with “1:1 mapping”. Thanks :slightly_smiling_face: Personally, I wasn’t bothered about being the same as PEP 621, except in the sense that I feel that the reasons PEP 621 chose dynamic rather than static are just as valid here, and so having a Dynamic metadata element is the right choice here as well.

That’s logically equivalent to the other two options, yes. But I personally don’t like it as much.

As proposed, Dynamic isn’t allowed in wheel metadata, only in sdist metadata. So this comment isn’t accurate. I don’t personally think it’s worth the added complexity of allowing that (nor do I want the additional controversy/debate that could come from it!) but I guess someone could argue extending wheel metadata to include Dynamic to mean “this wasn’t static in the sdist”. I’d prefer that to be a separate follow-up proposal, though.

FWIW, as mentioned here, this would almost certainly cause a bug in Warehouse right now, and I suspect it’s not the only place that implements parsing metadata in the same way. We’ve never made this kind of change before, and I think all, or almost of of the metadata parsing and emitting code I’ve seen in the wild largely assumes that the fields are the same between versions.

In the end, we could say all of those tools are buggy for making that assumption, and that’s not a wrong thing to do. I just think that overall the transition would go smoother if we kept the defaults as they were, and layered this on top of those defaults.

That being said, I wouldn’t block it over that, I just don’t think it actually sends any kind of signal (because I don’t think hardly anyone but the handful of tool authors will ever really look at it, and they already know the direction we want to go in) and I think it makes the transition smoother.

You is anyone who is producing or reading the METADATA files. It is entirely true that the statement is somewhat like “if X has a bug, it produces the wrong result”… but if we can reduce the ways bugs can occur, that seems like a net positive to me? Like what’s the downside?

This same bug exists for the other proposal right? Because there’s nothing stopping a project for putting X in the pyproject.toml in the repo, Y in the pyproject.toml in the repo, and Z in the METADATA in the wheel (and IIRC, that proposal wanted projects to put different things in X and Y above, because it wanted to “staticfy” some data if possible… like version numbers).

I don’t understand the bug you are describing here? My understanding of the idea would be something like this (not looking up the actual :

Before:

Metadata-Version: 2.1
Name: foo
Version: 1.0.0
Author: Paul Ganssle

After:

Metadata-Version: 2.3
Name: foo
Version: 1.0.0
Author: Paul Ganssle
Dynamic: Author

The fields are all interpreted as before. This adds a new field that annotates the other fields. The field that’s getting a default value in versions < 2.3 would be one that doesn’t exist today, so I wouldn’t expect it to cause any backwards compatibility bugs in existing software.

You are right, though, that it would simplify the parsing a bit, since (pseudocode) the version where you annotate the dynamic fields would look something like this:

def get_dynamic_metadata(core_pairs: Sequence[Tuple[str, str]]
  ) -> List[str]:
    for key, value in core_pairs:
        if key == "Metadata-Version":
            if Version(value) < Version("2.2"):
                return [key for key in core_pairs]
             break

    out = []
    for key, value in core_pairs:
        if key == "Dynamic":
            out.append(value)
    return out

If you mark static fields instead, the parsing code isn’t specific to the version:

def get_dynamic_fields(core_pairs: Sequence[Tuple[str, str]]
  ) -> List[str]:
    out = {key for key in core_pairs}
    for key, value in core_pairs:
        if key == "Static":
            out.remove(value)
    return list(out)

Wouldn’t get_static_fields exhibit exactly the same difference, but in reverse, though? (Harder if the chosen form were Static)

Offtopic, but to give a specific example, say version is dynamic in the source tree. You build a sdist, and in doing so, determine a version. The backend writes the version into pyproject.toml as a static field. If you now build a wheel from the sdist, the backend doesn’t have to do anything special, it reads pyproject.toml, sees a static version and writes it into the wheel.

With metadata, though, you build a sdist, and in doing so, determine a version. The backend writes the version into the sdist metadata as a static field. If you now build a wheel from the sdist, the backend just sees the same pyproject.toml as before, with a dynamic version, so it calculates the version. It now has an extra step, to compare the calculated version with the one in the sdist metadata, to ensure that the “this field is static” guarantee is not violated. That is what I meant by “backends need to check”.

But the consensus on PEP 621 seems to be that backends (setuptools, specifically) don’t want to rewrite pyproject.toml, so I’m assuming they are OK with adding the extra check instead.

Not checking at all is equivalent to saying that we don’t care if metadata matches between the wheel and the sdist. Any backend wanting to take that view must explicitly say that all the metadata fields are dynamic. (That’s one of the reasons I prefer having Dynamic be opt-in, so that backends can’t “accidentally” or “casually” make that choice).

1 Like

Or just not checking :slight_smile:

Surely this only applies to setuptools (or backends designed like it). PyMSBuild generates a PKG-INFO for the sdist, and wheel build just renames it to METADATA. So you could argue that it cares so much that it’s not going to check all the fields :wink: (the real reason is that the user spent all that time crafting their metadata and I’m not about to go second guessing that).

I’m pretty sure Flit also has no way to modify metadata during sdist->wheel conversion, so you’re designing specifically for setuptools here. Edit: Proven incorrect below

This is not true. Setuptools and Flit use the same strategy here. Metadata from sdist are not used at all when building a wheel. Wheel metadata are generated from the original user declaration in either setup.py, setup.cfg or pyproject.toml. IOW they are both treating all current sdist metadata dynamic.

1 Like

Given that the proposal only allows Dynamic in sdists, and not in wheels, this would need to change. Although it sounds like you wouldn’t have any use for dynamic metadata anyway, as there’s nowhere that could generate it in your backend…

1 Like

Yeah I’ve implemented Hatch the same way, it’s far simpler.