PEP 643: Metadata for Package Source Distributions

I’ve gone ahead and updated the PEP with that wording. The published copy will be updated shortly.

Starting my analysis / review of the PEP with nitpicks before I get into any substantive discussion of the content:

Should this be “an sdist”, or possibly “a source distribution”? I pronounce “sdist” in my head as “ess-dist”, which takes “an” as its indefinite article. Spelling it out would remove the problem / ambiguity.

Another inconsequential note: there was a decent amount of controversy about use of “dynamic”, since that’s roughly ill-defined. I do not want to get into a bikeshed about it, so I’m fine with us calling it “dynamic”, but we should probably define what we mean by “dynamic” explicitly and early on, maybe in the “Rationale” section. I think we should also probably reword the “Rationale” section so this is not tied to PEP 621. This problem pre-dated PEP 621 and PEP 621 was at least partially designed in anticipation of this, rather than the other way around.

If “Rationale” is mostly about why we chose the name “Dynamic”, then it makes some sense to just start with a definition like, “We choose the word ‘dynamic’ to refer to any field which is not fixed at the point that a source distribution — i.e. those fields which will be supplied by a backend when wheel metadata is prepared.” Then you can say that this is by analogy to a similar field in PEP 621, maybe?

This probably can be re-worded, because taken literally it would mean that you’d be allowed to write invalid values into a field if it was marked as Dynamic in the sdist, which isn’t true. Maybe we can say, “If a field is marked as Dynamic, no restrictions are placed on its value — so long as the value is otherwise valid — in a wheel built from the sdist.”

1 Like

A few substantive issues:

  1. I thought that we had settled on marking fields as Static rather than Dynamic, since the two are otherwise pretty much equivalent, but it does make it slightly easier to implement a parser that supports both 2.2 and earlier metadata versions.

    The “nudge” aspect of the choice of marking Dynamic vs Static I think is handled by the fact that there’s at least currently a whitelist of what can be considered Dynamic.

  2. That said, I’m not crazy about the fact that there’s a whitelist of what fields are allowed to be dynamic. I can see forcing Name and Version to be static because anyone not fixing those is doing it as a party trick anyway. The other stuff, though, is in a weird limbo where technically what we’d be doing is backwards-incompatible, but it’s probably justified because probably no one is relying on the dynamic nature of those other fields anyway.

    Right now setuptools allows you to set arbitrary fields differently between sdist and wheel time. If we enforce that all but a few existing fields must be static, then the only choices setuptools has are to throw an exception or to give priority to the static metadata present in the sdist. If we throw an exception whenever we can’t determine if the data is static or not, we’d cause projects like dateutil to simply pin setuptools rather than revamp the build system to help setuptools realize that some arbitrary generated field is actually static (though maybe setuptools could offer a “static wrapper” class that lets you mark certain fields as fixed to mitigate this). If we start pulling from the fixed sdist metadata, we risk breaking something silently, which might be worse.

    My thinking is that this is the kind of trade-off that probably should be left to back-ends rather than enforcing it in the spec. Practically speaking, setuptools will hopefully start parsing the AST of setup.py and we’ll find that in most cases we can determine that most fields are static. If we enforce it in the spec, there’s a surprisingly good chance that random projects with some use case we’ve never thought of before will show up a year or two into the implementation and say, “We were never consulted on this and it breaks our entire workflow!”

  3. I am mostly on board with the arguments for rejecting ideas 4 and 5 (no Dynamic in wheel and no value for dynamic fields), but I do think that as long as the semantics of this are clear, I don’t think it hurts to allow either of these, and I could see a few reasons why it would be useful to design it that way.

    If we were to say that the sdist metadata fields are populated whether or not they are considered dynamic but the fields are to be taken as a “hint” rather than a fixed value, it would allow for heuristics where, for example, pip or another resolver could read a list of dynamic dependencies from a source distribution, then start working on downloading resources in parallel while it executes the local metadata build to get reliable results — in most cases the set of dependencies in the sdist will significantly overlap with the set of dependencies in the final wheel.

    A potential use case for dynamic fields in wheels would be that it could make it easier to introspect about the reliability of your environment. If I look at my installed environment and none of my dependencies are dynamic, then I can wrap up the dependency graph and its environment markers and call it a cross-platform lock file. Without that, I need to download all their source distributions (assuming they are published) in order to determine which fields may vary on different platforms, and if I’m building a lock file, I also want to download the wheels, to get the hashes.

    At the end of the day, these use cases aren’t amazingly compelling, but both of these points are cases where we are forcing the metadata to be lossy (because at build time we usually have a value for the sdist, we just don’t know if it’s reliable, and at wheel build time we know whether the input was dynamic, we just don’t think it’s useful to include that information in the wheel). I’m inclined to go the lossless route here if it’s not otherwise expensive, since, after all, this whole PEP was necessitated by the lossy nature of the existing metadata spec.

  4. It’s not clear from this PEP where you are supposed to put the metadata file in the source distribution. Is this PKG-INFO? Are we adding a .dist-info/METADATA file? Is this buried somewhere in the Core metadata spec, which seems somewhat out of date since it says under distribution formats that PEP 517 and PEP 518 have not been implemented yet.

Edit: Noticed one more nitpick: in point 3 of “Rejected ideas”, it says “If a genuine use case is identified later, the specification can be changed to allow Rquires-Python to be dynamic at that time.”, which should have s/Rquires-Python/Requires-Python.

1 Like

One comment, and I’ll review the rest tomorrow.

It’s in PKG-INFO which is defined in PEP 517 under the build_sdist hook. That’s already standardised, and I don’t intend to change it (I have no stomach for a debate over file names - someone else can do that later if they care enough). The PR for the packaging user guide moves this into the PyPA specifications (and PEP 643 states that the PyPA spec is the new canonical location for the information), as PEP 517 is a very non-obvious place for the sdist standard to be located.

1 Like

I don’t think there was a particular consensus on the matter - there was some initial support for @dstufft’s suggestion, but the discussion felt inconclusive, and no-one has argued against the fact that I went for Dynamic in the PEP (until now).

One more significant argument in favour of Dynamic is that if we look forward to a situation when static data is the norm, and essentially most package data is static, do we want to really want to see long lists of Static declarations? Like the following:

Metadata-Version: 2.2
Name: pip
Version: 19.3.1
Summary: The PyPA recommended tool for installing Python packages.
Home-page: https://pip.pypa.io/
Author: The pip developers
Author-email: pypa-dev@groups.google.com
License: MIT
Keywords: distutils easy_install egg setuptools wheel virtualenv
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*
Static: Name
Static: Version
Static: Summary
Static: Home-page
Static: Author
Static: Author-email
Static: License
Static: Keywords
Static: Platform
Static: Classifier
Static: Requires-Python

Essentially the benefit of Dynamic is that it should become less common as time goes on. We’re marking the exception, not the rule.

I think the only backend (at least, that I’m aware of) that has a serious issue here is setuptools, which is something of a special case precisely because of its extreme dynamicity. In a sense, setuptools (more accurately distutils) is the problem we’re trying to address here.

I’m completely in favour of anything that makes it easier for setuptools to push for static-by-default behaviour as fast as possible. My expectation was that setuptools wouldn’t do anything, but bdist_wheel would verify that the metadata which got generated for the wheel matched the sdist metadata, and report an error if it didn’t. The effect of this would be that the user is responsible for not violating the static metadata requirement, but that’s fine, because in practice nobody does make these fields vary. If it turns out that there’s a significant base of people who alter (say) License at build time, we can relax the restriction at that point.

For fields which can be dynamic, I’d assume that setuptools (at least initially) defaults to Dynamic unless the data comes from a clearly-static source like setup.cfg or pyproject.toml. And bdist_wheel has nothing to do in that case.

But that’s the point - setuptools doesn’t need to make a choice, it can leave the detection to bdist_wheel (which has the information from both the wheel and the sdist). Think of static sdist metadata as a promise rather than a guarantee and you don’t have to decide when the sdist is generated.

I honestly don’t care much. @brettcannon asked for it to be an error and I didn’t see any reason against that choice. I am concerned that under your proposal, we’d effectively be saying that the metadata value in a sdist is tool-defined if it’s marked as Dynamic, but if it matters enough to you to want me to change it (and no-one else objects) then I’ll do so.

1 Like

I don’t really make a distinction between “setuptools” and “bdist_wheel”, but either way what you are describing is worse than any of the alternatives I suggested, because it will only detect an issue if the sdist build environment doesn’t match the wheel build environment. That’s way too late in the process. It means end users (who can’t do anything about it) will start having pip install <x> fail on them because the people who can make the change before a release don’t build an sdist then test building wheels from it on every other platform — they build and install all tests from source, and build wheels and sdists directly from source independent of one another (and on the same platform anyway), before uploading.

The only things we can do are:

  1. have bdist_wheel pull anything static from PKG-INFO first, ignoring any values that are generated by setup.py that might be in conflict with those in PKG-INFO
  2. Detect whether a value is static or dynamic at sdist build time and raise an exception. We can know that anything in setup.cfg is static, but without parsing the AST we won’t know if anything passed to setup() is. We’ll have to assume anything passed is dynamic, which will likely break a lot of builds.

Neither of these is especially desirable from a user experience or adoption point of view. If we go with the exception mode, we’d need to delay the roll-out of support for this until we have an AST-parser/reader that covers a sufficiently significant quantity of the setup.py files out there that we won’t be breaking everyone’s builds on day one. And AST parsing is going to need upkeep, because AFAIK the AST is not stable.

On the other hand, if we allow basically all fields to be dynamic, we can roll out today, and everyone with setup.cfg files will opt in immediately. We can gradually add some AST parsing heuristics without worrying about being perfect, because the metadata upgrades will happen silently as our heuristic processing gets better. We also don’t have to worry quite so much about the AST parsing getting out-of-date, because we can fall back on “if we don’t understand the AST, warn and emit a bunch of dynamic fields”.

I don’t really understand what you mean by “the metadata value in a sdist is tool-defined if it’s marked as Dynamic”. My proposal is that:

  1. In an sdist, a field with a value and Dynamic set should treat the value as a hint as to what the true value will be, but the canonical value (as it is in the current version of the PEP) requires a build to determine, because it is supplied by the backend dynamically.
  2. In a wheel, Dynamic refers to the provenance of the metadata. If a field is marked Dynamic in a wheel, the associated value is canonical for the platform / context you are using (as it is today), but you know that it was the result of a calculation at build-time, and may not apply to other wheels built from the same source distribution.

No one is clamoring for this information, but we definitely have it at build time and as I mentioned in my comments I can think of some situations where it would be useful.

My original objection to this was that this is not really presented for human consumption anyway, so it doesn’t matter that much what it looks like, but I’ve changed my mind and actually find this compelling. Even if it’s not a human-centric format, it will almost certainly be read by humans more often than people will write new libraries to parse it. So, +1 for Dynamic rather than Static.

Why would builds break? IIUC current setuptools/wheels discard all sdist metadata when building a wheel anyway, so all fields would be dynamic no matter if we go with Dynamic or Static—all fields in sdist metadata are dynamic. And after setuptools/wheel implements this PEP, they would be able to honour the dynamic/static definition correctly (only) for the new metadata version. It seems to me that everything would work as expected.

+1

The internal AST objects in CPython are not stable, but the ast module in the standard library is. Those who work on the former very rarely specify which they’re talking about, but you shouldn’t have to worry much about the AST walker changing (except for when you discover new patterns or syntax you have to deal with).

I just realised - you’re assuming that the “hint” should be “the value I would calculate for this field if I were being asked to build a wheel right now rather than a sdist”. That’s probably the obvious choice for setuptools, but might not be for other backends. And technically, the way you’ve worded it, there’s nothing to stop a backend writing a snippet of Python code as the “hint”, for the wheel builder to exec at build time. This is why I’m uncomfortable about not being more precise about what we mean by “hint” - while I don’t expect backends to do stupid things like this, I’d prefer not to have to think about the security implications :man_shrugging:

If we can word things so that assumption is explicit (and mandated) I’d find this proposal a lot easier to accept.

I see the point here, but the wording (in the PEP and in the core metadata spec) is already a bit convoluted because I need to explicitly make the link between “the metadata in wheels built from this sdist” and the sdist metadata. I fear that trying to express the meaning of Dynamic in terms of “the metadata of a different wheel built from the same sdist as this one” will further confuse the explanation.

It’s possible that I’m trying too hard to make the PEP too “legally airtight” here. I’ve been frustrated in the past by loose wording in standard-defining PEPs, and I may be over-reacting to that. If people are comfortable with the wording in the standard having a certain level of “being open to interpretation”, I’ll try to curb my tendency to mathematical precision :slightly_smiling_face:

The proposal in the PEP as currently written has many fields that are required to be static, which is something setuptools cannot guarantee in its current state. If we both require the sdist metadata to be canonical and require a bunch of inputs that can currently be set dynamically to be static, then we cannot implement this in a backwards-compatible way. We’ll either need to violate some expectations by making the sdist build the canonical source of metadata for those fields or we’ll have to throw an exception at some point in the process if we get a value that we can’t know is static. Without using AST parsing to detect situations where the metadata is static, that is every input specified in setup.py instead of setup.cfg. Ergo, in the implementation proposal you’re responding to (where we throw exceptions), nearly all builds would break.

The quote from the docs is this:

I don’t know exactly what this means in terms of stability, and I’d be happy to find that what we’re trying to do is not complicated, but I do remember that a lot of stuff that uses the AST under the hood tends to be the most fragile when updating Python, and it’s not great when setuptools is fragile.

What I’m worried about is more that we add in a bunch of heuristics to detect that something is unambiguously constant by parsing the AST, then a new Python version comes out where the AST parser returns a different (possibly equivalent) set of nodes for the same code in a way that breaks our heuristics. If we’re doing the heuristics on a “best effort” basis because the fields are not required to be static, then this may trigger some additional builds, but it won’t cause builds to fail like it would if we are throwing an exception whenever someone passes something that we don’t know to be static. Is that the sort of thing that can happen with the current stability guarantees?

Yeah, we can make the intention explicit. I am not really sure why a backend would want to put some non-useful value there, since the idea is that it should be a hint as to what the value is expected to be for consumers of the metadata (not for the backend itself). We can either explicitly say that it should be the value that the backend would generate if a wheel were built or some looser wording that would allow for some backend to play with the hint to try to give more useful information — e.g. if you know the dependencies will either be {a, b, c} or {a, e, f}, you could specify either {a}, since that’s the value guaranteed to be present or {a, b, c, e, f} since you know that the true value would be a subset of the specified value.

I don’t really care either way. In practice I think most popular backends won’t need dynamic fields at all, and setuptools would just build the wheel metadata and include that in the sdist (or some equivalent).

I don’t think it’s that confusing. The definition is clear: a field is marked Dynamic if it would be marked Dynamic when building an sdist. The only difference is that for sdists, Dynamic means “when you build a wheel this value might change (either from ‘no-value-specified’ or from the hint value)” and for wheels it means “the value was supplied by the backend at wheel build time.”

That metadata for dynamic fields in two wheels built from the sdist might be different is an inherent property of dynamic fields, not part of the definition. By marking the fields as Dynamic in the wheel, you are preserving the ability for tools to determine whether that property — or any other property of Dynamic fields — applies.

One thing that we haven’t figured out is whether it would be optional to mark these fields as Dynamic or not in wheels. I’m inclined to make it mandatory just for the sake of consistency, but I can see why it might be reasonable to allow leaving it out in wheels (since it has no bearing on the packaging system). If we make it optional, it would mean that you can only reliably know whether a given field was dynamically generated if it’s marked, and you cannot reliably know whether it was a static value.

I don’t think it’s particularly helpful to suggest that non-backend consumers can use dynamic metadata like that. The whole point here is to make data consumers read from the sdist reliable. So (in standards-legalese) I’d be expecting “consumers MUST NOT assume that data in a sdist that is marked as Dynamic will be the same in a wheel”.

But we’re picking over details. I’ll stick with my comment that if people want the spec to be less legalistic and more informational, I’m fine with that (but I’d need help getting the wording right).

If you can give me the actual wording that you’d like to see, that would help. I’m certainly confused over what you’re trying to say, and if I try to come up with words, I’ll probably reflect that confusion in what I propose :slightly_frowning_face:

In general, I’m not arguing particularly strongly against your suggestions (I’m mostly indifferent or only mildly against) but I’m unable to understand what you’re proposing well enough to modify the PEP to reflect your suggestions. Maybe you could make a draft PR against the PEP, so we had something precise to work from?

The point is that it acts like __length_hint__. It’s for situations where having a good guess now is better than having the exact right answer later. The canonical use case I mentioned would be if you are using a distributed workflow to do some sort of dependency resolution. For example, some pseudocode:

def dependency_graph(pkg, _node_cache={}) -> Node:
    if pkg_name in _node_cache:
        return _node_cache[pkg_name]

    # To make it more compact I'm assuming we need to use sdists,
    # though in practice we'd only hit this branch if no wheel is
    # available.
    sdist = get_sdist(pkg)  # Blocking, expensive
    metadata = get_sdist_metadata(pkg_name)

    # BLOCK 1
    if metadata.needs_build("Provides-Dist"):
        for dep in metadata.provides_dist:
             # Assume this is non-blocking; for brevity's sake
             # I have skipped any cancellation or de-duplication logic
             thread_pool.add(dependency_graph, dep)

        metadata = get_wheel_metadata(sdist)  # Blocking, expensive

    node = Node(metadata.name, metadata.version)
    # BLOCK 2
    for dep in metadata.provides_dist:
        node.add_dependency(dependency_graph(dep))

    return _node_cache.setdefault(pkg, node)

Obviously I’ve left out a lot of details there, but you can see the idea — in Block 1, you use the “hint” here to warm the cache in another thread or process while you are waiting for the build backend to get you the canonical answer to “what dependencies are available”. In most cases, the hint will be very close to accurate, so when you get to block 2, resolve_dependencies would hit the cache.

You could also imagine this sort of thing being used in a situation where you care more about not doing builds than you care about getting the right answer. “I don’t want to execute arbitrary Python code, but I want an estimate of what this dependency graph looks like”. The choices there are to completely ignore any nodes in the dependency graph that don’t supply wheels or to use the hint and get something that is probably pretty close to what you’d expect.

“A field is marked Dynamic in a wheel if it would be marked Dynamic when building an sdist.”

TBH I’m not entirely sure what’s confusing about it. What situations can you imagine where this would lead to ambiguity? What actions would you take or not take based on it being marked Dynamic in a wheel?

I see, thanks for explaining. My expectation was that setuptools can choose to not declaring the PKG-INFO it emits to conform PEP 643 if it cannot determine all of the required-to-be-static fields are static (i.e. not defined in setup.cfg, or are defined/overriden by a setup() argument). No builds would break because only the configs that setuptools can be absolutely sure conform to this PEP would use the new metadata version; others will continue to be built like they are right now, as if this new sdist metadata format does not exist.

I can see this may not be desirable, however, since the current PKG-INFO is not standardised (and useless), and the vast number of users won’t be able to take advantage of this sdist metadata due to package maintainers haven’t seem keen to switch away from setup() arguments.

My understanding from the PEP is that this is not an option:

The current wording indicates that we can’t fall back to earlier versions and we must report an error.

It means we do our best not to change things, but if legit reasons come up we will change it. Typically it means a new node type or an important simplification.

There are packages which help smooth the differences out like astroid and typed_ast.

Or in more fluffy language: “When read from an sdist/PKG-INFO, a field marked as Dynamic can have its value provided later in the METADATA file in a wheel. When Dynamic is in a METADATA file for a wheel, it is a marker for the provenance of the value as being generated by the wheel-building process and not directly from the PKG-INFO file from an sdist.”

2 Likes

I think naming the field something like Dynamic-In-Sdist would make it more true to its meaning

1 Like

I think we can put off the question of naming until after we nail down the general details. I personally don’t think it matters very much. There’s not much else that Dynamic could mean in the context of built wheel metadata. I’m not sure what we’d be guarding against by trying to telegraph that in the name.

2 Likes

The discussion appears to have died down, so I’d like to summarise where we are on these points.

  1. On Static vs Dynamic, I’m going to stick with Dynamic. You gave that a +1 here, so I assume you’re OK with that.
  2. You seem to have moved from just being “not crazy about the idea” of a whitelist to being fairly strongly against it. I still want to prohibit gratuitous use of Dynamic, so how about this as a compromise?
    • The fields Name and Version MUST NOT be marked as Dynamic.
    • Backends MUST NOT mark a field as Dynamic if they can determine that it was generated from data that will not change at build time. (This is intentionally a bit vague, to allow backends flexibility to decide how hard they try to determine if the data is static - I expect setuptools to initially just consider setup.cfg and pyproject.toml to be static, but maybe to add checks for setup() later, if they feel it’s useful - and I want the spec to allow that).
    • (the existing point) Backends SHOULD encourage projects to specify metadata statically, preferring to use environment markers on static values to adapt to details of the install location.
  3. I’m not going to fight for disallowing values against Dynamic or Dynamic in non-sdists, so how about:
    • Backends MAY record the value they calculated for a field they mark as Dynamic in a sdist. Consumers, however, MUST NOT treat this value as canonical, but MAY use it as an hint about what the final value in a wheel could be.
    • In any context other than a sdist, if a field is marked as Dynamic, that indicates that the value was generated at wheel build time and may not match the value in the sdist (or in other builds of this project). Backends are not required to record this information, though, and consumers MUST NOT assume that the lack of a Dynamic marking has any significance, except in a sdist.
  4. This has already been covered, but the location of the metadata remains as specified in PEP 517, and this is in the packaging.python.org spec update.

I’ll rewrite the rationale section of the PEP to take into account your previous comments as well - I agree the current rationale is weak, and unnecessarily tied to PEP 621. And I’ll sort out the other points you mentioned at the same time.

I think that’s all of the outstanding points on the PEP. Did I miss anything?

2 Likes

I’ve weakened that to “source distributions SHOULD use the latest version of the core metadata specification that was available when they were created”. As it stands, if we create a new version of the metadata spec, we instantly invalidate all existing sdists, which is silly and wasn’t the intention.

Does the name include information about the Provenance of the data item? Which agent generated the value? When? (At sdist build time.) Did they sign it?

Any such metadata can be more efficiently modeled with a schema that describes each data item.

FWIW, in terms of normative language in regards to schema,

RDFS+SHACL and/or JSONschema are two ways to model (meta)data schema which contains enough information to choose a widget and also do client-side validation.

W3C PROV in Python


:
a1 = document.activity('a1', datetime.datetime.now(), None, {prov.PROV_TYPE: "edit"})
# References can be qnames or ProvRecord objects themselves
document.wasGeneratedBy(e2, a1, None, {'ex:fct': "save"})
document.wasAssociatedWith('a1', 'ag2', None, None, {prov.PROV_ROLE: "author"})
document.agent('ag2', {prov.PROV_TYPE: 'prov:Person', 'ex:name': "Bob"})

^^ That generates triples and/or JSON-LD.

More complete examples https://github.com/trungdong/prov/blob/master/src/prov/tests/examples.py :

The spec for describing how Agents’ Activity ies generated/derived/used which Entity


It’s probably pretty easy to generate PROV JSON-LD without the (convenient) python prov library, or indeed any understanding beyond that the attribute names start with prov: and the schema is in a separate file.

  • “Dynamic” means “Computed at sdist build time”

  • Which Agent ran and signed that (downstream re-) build/compile Activity which involved the package Entity?

Presumably, the Dynamic value of the Entity metadata attribute is set by the Agent doing an Activity.

Presumably, the Dynamic value of the package metadata attribute is set by the Agent (sometime?)

Is this correct?:

[During/Before/After?] the build/compile Activity, the a Dynamic attribute is set to its currently static value.