Brainstorming: Eliminating Dynamic Metadata

ntessore · November 24, 2024, 10:52pm

To ask a more concrete question, what existing solutions are out there for replacing the run-of-the-mill use of dynamic versions from setuptools-scm with static metadata? I guess one essential feature is to have version bumps triggered by, e.g., GitHub releases. And I guess a desirable feature is to have local development versions.

fungi · November 24, 2024, 10:57pm

I disagree. That dynamic metadata is used, and that it has uses is
obvious and discussing it does not yield new information. It is
the known status quo. What needs to be figured out is not a
complete enumeration of all the things that will break, but to
figure out how the common things that would break can be solved
instead.

Okay, let’s try from there. Projects I work in have reviewers who
approve changes from code review that then get merged to official
version control. These projects also have release managers who
decide what specific already-merged commits within the repository
will be release points, and assign versions to them by pushing
version control tags. The tag events are picked up by a CI system,
and the state corresponding to the tagged commits is then packaged,
with a Python packaging build backend determining the package
version metadata from the tag name. The build process stores the
literal version string which it computed into sdist and wheel
metadata files.

First, do you consider this dynamic metadata in the sense that
you’re looking to eliminate? If so, how would you solve it?
Would the solution necessarily require altering access and
permissions, assigned roles, and workflows?

mitsuhiko · November 24, 2024, 11:05pm

Given the existence of dynamic metadata today, I’m not sure there has been much of an appetite for this sort of thing. A common alternative is to provide commands to bump versions in pyproject.toml:

Rye: version - Rye
Poetry: Commands | Documentation | Poetry - Python dependency management and packaging made easy
uv feature request: Add a command for reading and modifying (i.e., bumping) the version in the `pyproject.toml` · Issue #6298 · astral-sh/uv · GitHub
PDM plugin: pdm-bump · PyPI

I’m not sure what other things exist in the ecosystem today.

It’s quite clear that having a distinct version for local development versions is what we currently have in Python at times. But do we have to? That behavior is not necessarily something that other ecosystems attribute a lot of value to. Whatever version ends up in package.json is the version that shows up through discovery locally and the same is true for Cargo.toml in Rust. That typically is a manually set X.Y.Z.dev0 version or even just the most recent release / next release. That’s not perfect by any means, but it seemingly is what is acceptable in other ecosystems. The SNAPSHOT approach in Java was also already mentioned as alternative.

My baseline take is that any metadata is not serialized into the pyproject.toml I would consider it dynamic metadata.

Yes, I would assume that your solution would require a change to the CI setup. You would already have to do that when performing a release for JavaScript, Rust, Go, Java or probably any other package. As for what permissions or else are needed I cannot answer that, because that really seems to be a CI question and not a metadata question.

For an example of an alternative flow: any developer at Sentry can trigger a release, but they cannot publish it. Where is a workflow action called release, which asks for a tag. That will perform a commit to a branch with the necessary actions, and create a pull request via a bot to a release publish repository where a release manager can sign it off. An example of how that works: publish: getsentry/sentry-python@2.19.0 · Issue #4631 · getsentry/publish · GitHub (triggering commit: release: 2.19.0 · getsentry/sentry-python@c83e742 · GitHub)

mikeshardmind · November 24, 2024, 11:10pm

I think these two statements point out the crux of why this isn’t going to be productive. You’re framing dynamic metadata as something we shouldn’t have still, and it comes off like you’d rather force everyone else to do more work for a use case they may not even care about. I’ve been using python for almost 2 decades now, and I’ve run into an issue with dynamic metadata exactly once in all of that time.

I don’t think there’s a future where we eliminate dynamic metadata entirely. Not within the existing packaging ecosystem, not without breaking untold amounts of use cases that may not even be visible to us, and forcing that on users isn’t going to make people happier about packaging, it’s going to result in them installing something that doesn’t mark that it’s a development version anymore and then getting less useful information, not more.

What do you see as the actual reason why dynamic metadata is a problem?

All I’ve seen here is that editable installs can be sometimes problematic, but that actually has nothing to do with dynamic metadata, and it can be broken with an editable install with static metadata too if the source of the static metadata is changed, it’s not picked up without reinstallation. Dynamic metadata is orthogonal to that. If it’s dynamic from the perspective of the build system, but not from the perspective of environments installed into, I don’t see any issue.

mitsuhiko · November 24, 2024, 11:21pm

That is correct, as outlined above I strongly believe that dynamic metadata should not exist.

Which might very well be the outcome of this entire thing. We might realize that dynamic metadata is so entrenched and that there is no/not enough appetite to resolve this. The likely consequence of that though will be that the problems dynamic metadata causes will continue to result in sub-par experiences with installers that want to find ways to cache metadata or to rely on the metadata provided by PyPI.

I understand that this is probably the predominant view but I think that this view mostly comes from familiarity with the current situation, the feeling that removing it will remove flexibility, and that the shift away from it might be painful for the ecosystem. I tend to think that such things can be solved if there is an appetite to change it. For instance installers can continue to support dynamic metadata, but warn users if they continue relying on packages using dynamic metadata with a long deprecating warning. That might be enough for slowly make that feature less useful. Installers could then start to cache metadata more aggressively etc.

I think editable installs are a bit of a distraction in that discussion because editable installs are in many ways a problem on their own. sdists today do not use the metadata that is contained in the sdist, they generate their own upon installation and there is no requirement that those are matching what is in the archive.

The actual issue is that dynamic metadata does not come with enough meta information that an installer or packager can do much with it. We do not know what invalidates it, we do not know if it’s universally true or only for this platform. It’s unclear how to cache it because it’s unclear how to re-validate it.

mikeshardmind · November 24, 2024, 11:22pm

Here’s an example of what I mean by that from real-world code

from setuptools import setup
import re

def derive_version() -> str:
    version = ''
    with open('discord/__init__.py') as f:
        version = re.search(r'^__version__\s*=\s*[\'"]([^\'"]*)[\'"]', f.read(), re.MULTILINE).group(1)

    if not version:
        raise RuntimeError('version is not set')

    if version.endswith(('a', 'b', 'rc')):
        # append version identifier based on commit count
        try:
            import subprocess

            p = subprocess.Popen(['git', 'rev-list', '--count', 'HEAD'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            out, err = p.communicate()
            if out:
                version += out.decode('utf-8').strip()
            p = subprocess.Popen(['git', 'rev-parse', '--short', 'HEAD'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            out, err = p.communicate()
            if out:
                version += '+g' + out.decode('utf-8').strip()
        except Exception:
            pass

    return version


setup(version=derive_version())

When installed from published releases on pypi, the version on pypi just is the version.

When installed from git (typically when users want to use new features either to test or because discord made changes, but there hasn’t been a stable tested release yet), they get a version that has useful information (specifically in the correct part, the local version label!) , but this isn’t dynamic once installed (the intent here is not for editable install use)

Then don’t cache it! importlib.metadata can still record the installed version, but resolvers shouldn’t consider it portable information to use as a cache.It wont change out from under you unless it’s an editable install.

mitsuhiko · November 24, 2024, 11:36pm

To be clear: I know. I have written code like this before. It’s quite widespread in Python and I believe that this is a problem.

But if I don’t cache it, then I am forced to re-install the package all the time. I cannot even keep a wheel locally around because how do I know that the metadata was not supposed to change? If we play this to it’s local conclusion you could not even rely on the pypi published metadata for an sdist at all and you would have to build every single one of them to see what actually comes up.

Now I realize that version is a special case here, because on pypi at least that is supposed to be static even for sdists, but the problem really is the exact same with the dependency array.

Saying “don’t do that” is obviously a thing you can do, but the reality is that resolvers already today very much rely on being able to trust the metadata available on pypi and that this metadata is not randomly changing. At which point we’re mostly down to sdists. An sdist will have published metadata, but it’s entirely discarded and replaced with locally created metadata instead.

A rather modest proposal would be to keep the entire system in place, but force setuptools and other systems to discard all dynamic sections for what is already in the PKG-INFO of the archive.

mikeshardmind · November 24, 2024, 11:44pm

I’m going to heavily disagree with this.

Okay, so this is where I think we can at least find common ground to work to alleviate some of your concerns. If the goal is determining when to invalidate or not, I think we can come up with some rules and a way to mark it.

Version: we should be able to say

this isn’t allowed to change for an sdist
this may be possible to change for a pep 508 dependency link to VCS

Dependencies (The other one I’ve seen a reason to be dynamic, especially with GPU accelerated code):

We may be able to make this static instead with more capable environment markers
Otherwise, we need a way to mark depending on a hardware capability and possibly specific things being present.
It might be possible to allow marking “this was calculated based on hardware capabilities” which would mean it should only end up invalid if someone makes it cease working by removing/changing a relied upon hardware component.

Readme and description: I think leaving these dynamic without any invalidation mechanism is fine and shouldn’t change anything important for caching by installers/resolvers.

And… you’ve lost me again. I don’t think discarding it is a good idea. Marking it as dynamic, and when possible, marking how to invalidate it, absolutely, but the information is useful in the current state to everyone not trying to use it as a cache.

fungi · November 25, 2024, 2:22am

For an example of an alternative flow: any developer at Sentry can trigger a release, but they cannot publish it. Where is a workflow action called release, which asks for a tag. That will perform a commit to a branch with the necessary actions, and create a pull request via a bot to a release publish repository where a release manager can sign it off. An example of how that works: publish: getsentry/sentry-python@2.19.0 · Issue #4631 · getsentry/publish · GitHub (triggering commit: release: 2.19.0 · getsentry/sentry-python@c83e742 · GitHub)

Pushing a new commit means the point in time commit which was tested is no longer the one being released. Probably the only acceptable compromise, for the projects I work on, would be to ship a different pyproject.toml file which has been edited to hard-code a version string, and not check that file into version control (or check it in with a placeholder that gets replaced at build-time with the actual calculated version string). Is not storing pyproject.toml in version control, or storing a different copy of it than is included in released packages, still too dynamic?

Thinking back, this sounds a lot like the Maven “snapshot” version placeholder workflow described earlier. Since you assert only Python has dynamic metadata, is that Java ecosystem example sufficiently non-dynamic?

fungi · November 25, 2024, 2:32am

A rather modest proposal would be to keep the entire system in place, but force setuptools and other systems to discard all dynamic sections for what is already in the PKG-INFO of the archive.

The SetupTools plugin build backends I’m aware of enabling dynamic version strings do precisely this already, certainly the one I’m a maintainer on does. First look for PKG-INFO and use the version string from it if it exists, otherwise inspect the state of the Git repository to compute a suitable version string for inclusion into the PKG-INFO it’s going to create (also check for the existence of environment variable overrides before looking at either of those, of course, to give distro package maintainers a convenient escape hatch for setting a specific version of their own at build time).

bwoodsend · November 25, 2024, 2:51am

Would installing things by commit hash rather than version work for you? i.e. The pip install package @ git+ssh://git@xxx.com/yyy/package.git@adc83b19e793491b1c6ea0fd8b46cd9f32e592fc syntax?

It seems weird that you’d go to such lengths to attach a version number to an arbitrary commit when that commit’s hash already uniquely identifies the revision.

fungi · November 25, 2024, 3:08am

Would installing things by commit hash rather than version work for you? i.e. The pip install package @ git+ssh://git@xxx.com/yyy/package.git@adc83b19e793491b1c6ea0fd8b46cd9f32e592fc syntax?

It seems weird that you’d go to such lengths to attach a version number to an arbitrary commit when that commit’s hash already uniquely identifies the revision.

We don’t recommend that our users install these packages from Git to begin with, but regardless a raw commit Id doesn’t convey semantic version nuances, which is part of the reason version numbers are picked by a different set of people than the ones approving arbitrary commits in code review (changes are reviewed and approved in parallel, versions are chosen after the fact in order not to impede developer/reviewer velocity by forcing them to quiesce and set versions during the development process).

mitsuhiko · November 25, 2024, 8:12am

I don’t understand how you can come to that conclusion. The commit is literally the one that is being tested. In fact, that is also to some degree necessarily. Imagine someone would funnel stuff like this into the code base:

if VERSION == "2.0.0":
    # run some code that only runs for release versions

If you were to not have the right version information available (eg: tests ran against the not yet tagged version) it would behave differently.

In an ideal world what comes out of the build step is a PKG-INFO file that 100% matches what would be generated by what comes bundled in the sdist already. You have already alluded to that above by saying that version and dependencies should not change, no? So what happens then if they do change. Today the current output of the build step overrides what was in PKG-INFO and the behavior is undefined largely what happens (up to the installer).

Let’s say step 1 would be to compare what’s in PKG-INFO and the current build step and warn if they are not matching? I think the net benefit to the ecosystem would be significant if tools could start to rely on the baked in metadata more.

pf_moore · November 25, 2024, 10:25am

Not true, unless the PKG-INFO explicitly allows that, by marking fields as dynamic. And version, in particular, is not allowed to be dynamic in sdists. So if that’s your definition of “dynamic”, I fail to see why we are discussing versions at all. Yes, if you build from a source tree rather than a sdist, version is dynamic (as in, it can change after each build). But so is everything - the user could edit pyproject.toml after all.

Even in the rust world, version is dynamic if you allow the user to change cargo.toml or build.rs.

What’s most frustrating about this discussion is that you refuse to clearly define your terms, and you keep changing your arguments. What seems to everyone like trying to understand what you mean by “dynamic”, you attack as “arguing that dynamic is essential”. Unless you can describe what you want to achieve more clearly, in terms that others can understand, I don’t think this discussion is going to achieve anything

mitsuhiko · November 25, 2024, 11:15am

I think it will be better to close this discussion then. Maybe someone else can make this case clearer, clearly I failed to express myself in a way that’s constructive.

//EDIT: I tried to summarize my thoughts here if there are curious readers. I think @konstin made a good case for the uv angle here.

oscarbenjamin · November 25, 2024, 12:35pm

It might help to explain more concretely what it is you are doing when these things become an issue. I don’t understand what you are doing with sdists/wheels/git etc but I assume that it is quite different from my workflows in which I don’t generally have issues with dynamic metadata.

brettcannon · November 25, 2024, 11:37pm

In my case I wouldn’t call it a “need”, but a “makes my life so much better” as it prevents me from having to do any special commit to make a release to PyPI. I can automate calculating the version from the Git repo, create the sdist and wheels, do the release to PyPI, and create a GitHub release with the release notes which also creates the Git tag. And all of that is done via a GitHub Action w/o having to make any commits to the repo before or after a release. Now I could automate the action to do some commit to record the version number, but that’s more work and I personally don’t like having code do automated commits to the repo. And obviously having to do the commit manually is an extra step.

For me personally, what I’m looking from you is a suggestion that’s going to make my life no worse and potentially easier for giving up dynamic version numbers (and that doesn’t have to be a 1:1 swap involving version numbers). I totally understand the motivation for what you’re after (dynamic exists partially to make it very obvious to people what’s no there and I helped create the [project] table to get this stuff out of code as much as possible). But people just aren’t feeling the pain enough day-to-day from dynamic metadata thanks to the tool authors to want to give it up. Maybe if uv, Hatch, PDM, and Poetry came forward and said, “your lives would be better in these ways if you just gave up dynamic metadata” then that would sway people.

konstin · November 26, 2024, 8:30pm

Here are four unsolved problem with dynamic metadata.

If the version of a package can change, how can a resolver tell if it will satisfy the constraints?
If the dependency ranges can change on build, how can we serialize a resolution?
How can a package manager tell whether a package needs to be reinstalled or can be reused? More succinctly: What are the exact rules for cache invalidation for non-wheel installs?
How can we perform and fast and deterministic resolution if we have to execute arbitrary code to determine dependency graph metadata?

These are not academic concerns, these problems are blockers for modern package managers in python. From my own experience in user support, the “python packaging is broken” that so many users experience is mostly problems with metadata in some form. From my uv perspective specifically, dynamic metadata and cache invalidation is the only widely requested uv feature that we can’t solve with an engineering effort, because we’re clashing with the (lack of) standards.

If you need one specific question to solve, let it be this: What are the exact rules for cache invalidation for wheel builds with dynamic metadata? When do i need to reinstall an editable (pure-python, for simplicity)? Under which conditions can I reuse a wheel built from a source dist, a path or a git repository?

Even in the rust world, version is dynamic if you allow the user to change cargo.toml or build.rs.

In cargo, the metadata needed by resolvers is static, it cannot change on build. Since rust is compiled and can’t do symlink-like editable installs as the interpreted python, build.rs has a dedicated syntax for cache invalidation of the builds themselves.

A more specific example built on point 1 (dynamic package version) and lead-up to a feature we can add for dynamic metadata specifically: Say you’re depending on foo and bar. foo depends on bar >=1.0.0,<1.1.0. bar gets a bugfix that we need, so now we’re replacing bar with a git dependency (with a specific commit). bar determines its version from the latest tag in the repository. So when we resolve, we get bar 1.0.1, and write this to a PEP 751 lockfile. Until installation, a new tag is added to the repository, and the build gets tagged as bar 1.1.1. Our resolution/installation is broken, because bar >=1.0.0,<1.1.0 is not fulfilled anymore. That means it’s impossible to write sound PEP 751 lockfiles if we can’t guarantee that the version doesn’t change.

The above scenario is clearly a bad build system. The more common scenario is that projects build every commit, and needs a unique version for the artifact. Notably, the release part of the version does not change. This is solvable by adding features to PEP 621! We could for example say: bar declares it is at version 1.0.1, plus a local version after build. A resolver sees that 1.0.1+<…> always matches >=1.0.0,<1.1.0, so the tree is valid. In the lockfile, we’d equally record this as “1.0.1, plus dynamic local version suffix”. To migrate, the user only needs to declare something like (fictional) project.dynamic-version = ["local"]. This is the kind of outcome i’m interested in from this thread, and what i believe we need for PEP 751 to work.

As an already successful example: Say a user wants to migrate from the version in __init__.py and reading it with python code on build to PEP 621 with a static version, they can use __version__ = importlib.metadata.version("foo"): importlib.metadata is a one-liner “backport” that avoids a breaking change for downstream users that read foo.__version__ (and potentially have been since before importlib.metadata).

For point 4, i’m not talking about a 10% performance optimization in uv, i’m talking about the difference between an HTTP request with a few kb max vs. a git clone / source dist download, resolving a PEP 517 environment, installing the environment and running the PEP 517 hook, which is an orders of magnitude difference.

Maybe if uv, Hatch, PDM, and Poetry came forward and said, “your lives would be better in these ways if you just gave up dynamic metadata” then that would sway people.

My main motivation for pushing for static metadata isn’t that it makes your live better for publishing, but that dynamic metadata makes everything else worse, indirectly, in tiny pieces. But I’m also willing to make the stronger claim: While it’s more effort to set up initially, especially coming from a very dynamic, scripted setuptools ecosystem, new static metadata workflows can be much better workflows eventually. GitHub - release-plz/release-plz: Publish Rust crates from CI with a Release PR. is a great example which does the complete release workflow, including even things such as checking API compatibility.

The tension that i see, and that is i think a point that @mitsuhiko tried to make too, is that users have invested into their CI and release workflows in a pre-PEP 621, pre-lockfile world, and these now clash with the needs of modern packaging (Long-form: https://discuss.python.org/t/pep-751-now-with-graphs/69721/87). To me, the question is what is minimal amount of change that enables modern packaging? What features do we need to add, and how can we make necessary migrations as smooth as possible?

bwoodsend · November 26, 2024, 8:58pm

Can we please not conflate high-levelness with modernism. Being able to automagically manage lockfiles and virtual environments (a thing which some people like but others actively avoid) is a requirement of a workflow manager only – not of a package manager being modern.

pf_moore · November 26, 2024, 9:44pm

tl; dr; Thanks for the detailed explanation. Having read it, I’m confident that what you really have a problem with is building from source trees - which are not standardised, as you correctly point out. But that’s not from lack of desire or any sort of blindness to the issue, it’s simply because no-one yet has come up with any sort of workable proposal for a standard. Feel free to be the person to change that, if you want

The rest of this post is what I wrote point by point going through your comments. It may be more detail than you’re interested in.

“Can change” under what circumstances, though? This is the problem we keep hitting, no-one explains what the constraints are. If you’re using a sdist or a wheel, the version can’t change. It’s stated very clearly in the standards. If you’re building from a source tree, anything can happen - the standards currently quite deliberately don’t say anything about source trees, because it’s simply too hard a problem to tackle without breaking a lot of currently valid code.

Here’s my canonical example of a pathological source tree.

pyproject.toml:

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

setup.py:

from setuptools import setup
from random import randint

major = randint(1, 10)
minor = randint(1, 10)

setup(name="pathological", version=f"{major}.{minor}")

Even though the example is pathological, this is also not an academic concern. What should a standard say about source trees that would disallow this example without breaking legacy setuptools-based codebases (of which there are almost certainly millions in existence)?

I’m not sure what “serialize a resolution” means. I’ll assume it leans “lock the project”. But the answer is basically, fail if you encounter this (or make a simplifying assumption and clearly warn the user, if you prefer to). The metadata necessary to tell when it’s going to occur exists^[1]. I don’t see why writing a standard that prohibits anyone from doing this is better than the existing standard which allows tools to say they don’t support it, while still allowing people to do it if they avoid such tools.

For sdists, if the sdist changes, or every time if there’s any dynamic metadata. For source trees, every time. You may not like this answer, but it is accurate (possibly conservative, but I don’t know how to determine statically what arbitrary code can do so that’s the best I can do…).

You can’t. This is one of the reasons pip is slower than uv - we follow the standards, and they have performance implications.

That’s overstating it. Pip handles all of these situations. It’s slow, I won’t try to claim otherwise, but in my view, performance is secondary to correctness.

I absolutely agree with this. Usually the problem is incorrect metadata, though, which is basically another way of saying “user error”. I don’t think packaging tools can (or should) fix it if the package developer provides incorrect data.

OK, I’ll agree with this. But people have been working on Python packaging for many years now. Are you suggesting that we don’t know that executing arbitrary code at build time is bad for determinism and performance? Without too much exaggeration, I could reasonably claim that every standard created in the last 10 years or more has been aimed at the specific goal of limiting or removing as many ways to use arbitrary code to calculate package metadata as we possibly can.

You can’t wish away the existence of setuptools (and distutils before it). And while you may only see the disadvantages, I lived through the period in Python packaging before distutils, and for all the problems we now see in it, distutils was a huge step forward, and arguably made Python the language it is today. Modern tools are better, but every single one of them has been developed to handle a limited subset of what setuptools can do. We have no idea^[2] what other setuptools features are used, in projects we have no visibility of.

If the wheel changes, invalidate the cache. It really is that simple. There is no dynamic metadata in a wheel.

That’s a different question - editable installs are not standard wheel installs, and as they are tied inextricably to source trees, there are no limitations on what might trigger the need to reinstall. See my pathological case above - install that as editable, and you immediately need to reinstall. If I’d made the version be based on the date, you could need to reinstall at midnight every day. And so on. There’s no programmatic answer to this question - the developer needs to understand the code they wrote, and based on that knowledge, reinstall when they need to.

No. What makes everything worse is supporting source trees. I’m getting tired of repeating this, but there are no guarantees as soon as you use a source tree. What you are saying here is that you wish there were standards around source trees. Fine - feel free to propose some.

Yes, dynamic metadata^[3] is occasionally problematic as well. But prohibiting that will make approximately zero progress on solving any of the problems you state here, all of which are about source trees, not dynamic metadata.

once again, excluding the case of source trees ↩︎
and no way of finding out, without resources that simply aren’t available on a volunteer basis ↩︎
as in, metadata in a sdist that is marked as dynamic ↩︎