Brainstorming: Eliminating Dynamic Metadata

pradyunsg · November 17, 2024, 1:42pm

does anyone have suggestions for “Improving support for static metadata”?

If someone does, let’s collect those in a separate topic. Mixing multiple topics in a brainstorming thread will only lead to chaos.

mitsuhiko · November 17, 2024, 3:29pm

I would draw the opposite conclusion: this is a good example of why dynamic metadata should not exist. That is also the status quo in other ecosystems like for instance node. Metadata is readable just fine in node, but you cannot have it dynamic. Node also does not draw a line between an “editable install” or anything else really.

I have the firm opinion that not removing dynamic metadata or greatly curtailing it, is a tremendous tax on Python packaging with very little benefit to the user. In fact, the downsides of dynamic metadata might be significant enough that they also result in a worse user experience due to all the bugs and incompatibilities it can cause.

mikeshardmind · November 17, 2024, 4:06pm

Great, I don’t even disagree, but other than a time machine to prevent decades of people relying on it that we aren’t going to break, there’s no viable path to this. So that leaves what I said, ensuring that there’s enough information for things that expect static metadata to know when they don’t have it.

sirosen · November 17, 2024, 4:09pm

They aren’t left with much choice in this matter though, are they? In order for a package to be installed, it needs to have metadata populated. There is no current system for post-hoc “fixing” that installed package metadata when the source changes.

Suppose I have a package with a requirements.txt file which is the source for dependencies. I install it as editable into 4 distinct venvs. I then edit the requirements file to add a dependency. What happens?
Well, “obviously” nothing happens. But the metadata is now desynchronized and needs to be fixed in each environment.

Version is somewhat special, given that it needs to appear in filenames. Dependencies are also a little special, in that fixing them may require attempting package installs.

What this highlights is that editable installs only work in a basic way. As soon as you expect editables to result in self-consistent environments, they break down.

IMO the notion that an editable install should work 100% as well as a non-editable install puts us in a position of making the perfect the enemy of the good. Editables are useful but also problematic and flawed.
I think that getting rid of the notion that version is dynamic – by providing good ways of making it static, not by some pointless dictat – would actually make things better for users of editable installs, since it would make it more obvious that there is some desynced metadata. People treat this as “obvious” with dependencies, which reveals that it’s not the failure to automatically fix environments that’s at issue.

pf_moore · November 17, 2024, 4:39pm

This was very much the position we ended up in with PEP 660. Nobody has ever come up with a viable way of producing a 100% editable install (where any change you make to the source is immediately reflected in all editably-installed copies). On the other hand, no-one has ever written a complete specification of what is and isn’t allowed to require a reinstall even in an editable install. But equally, there’s no realistic way of removing editable installs - they are simply too useful to many people’s workflows.

I think dynamic metadata is similar. Yes, of course it would be nice to not allow it. And maybe, if we had a clean slate, we’d design a packaging system that didn’t allow it. But that’s not the reality we live in.

Could someone write a tool that prohibited dynamic metadata? Yes, of course they could - the mechanisms exist for build frontends and installers to report an error if they ever see metadata marked as dynamic in any built artifact. Would such a tool have sufficient advantages to gain a user base? I have no idea. But if anyone wants to try eliminating dynamic metadata, that’s probably the best way of doing so. Create a tool that demonstrates the benefits of static-only metadata, and then see if those benefits are sufficient to persuade users to adopt the tool. With sufficient popularity, there would be user pressure on projects to be compatible with static-only-tool.

A gradual, user and benefits driven approach like this is almost certainly far more useful and likely to succeed^[1] than any amount of debate over how to remove dynamic data from the standards in an abstract sense.

to be clear - I still don’t think it will succeed, but it’s more likely than any alternative ↩︎

oscarbenjamin · November 17, 2024, 5:30pm

This is really because all of the standards have been designed with the hope that dynamic metadata could be eliminated rather than attempting to accommodate the situations where projects currently make use of it. An editable install is one such situation: the real version of the actual package code is not static so the metadata can only be static by being wrong. The whole notion of an editable install is in direct contradiction with the idea of static metadata.

It would not be hard to standardise something about editable installs or source trees so that the version could be dynamic but still more efficiently accessible. No one has standardised such a thing because the people pushing to standardise these sorts of things have always wanted to insist that all metadata must be fully static. The dynamic = ['version'] escape hatch was conceded but the standardisers did not want to improve on that as a mechanism because they hoped the whole dynamic business would eventually just go away.

In other situations the version could really be static but the standards/tooling don’t allow projects to do whatever it is they want to do without using dynamic = ['version']. An example is the hatch version command: you can’t use hatch to update the version if the version is written statically in pyproject.toml. There is no reason why having a version statically in pyproject.toml needs to be mutually exclusive with being able to update that version with a command but that is the current situation.

sirosen · November 17, 2024, 7:18pm

I don’t agree. Or at least, I think this is more fraught than your characterization makes it sound.

Sticking with setuptools_scm, getting the version involves inspection of a deep clone of a repo. It’s slow, relative to a simple file read, and the results can change while a process is running. Both of those sound to me like problems – for example if I grab the version of my package via importlib in a test fixture, I could repeatedly be pulling version info while tests run.

You could say I’m overthinking it, maybe I am, but my instinct for design tells me that this “feels wrong”.

I’d be 100% onboard with improvements for the behavior of editable installs. I don’t know that it’s necessarily in conflict or even in tension with “improving editable installs by making it clearer that versions are static and that changing them currently requires a reinstall”.

pf_moore · November 17, 2024, 7:39pm

I think you’re being unreasonably critical of the standards process here. If you have a concrete proposal for how we could have standardised editable installs (or a proposal for standardising installing from source trees) I’d be more than happy to hear it.

Existing standards haven’t so much been “hoping that dynamic metadata could be eliminated” as trying to minimise the problems that dynamic metadata already caused. You must remember that historically, all metadata was dynamic (in distutils and setuptools). We didn’t deliberately create standards that ignored all of that history - quite the opposite, we were trying to find ways of fixing the issues that history caused, while still leaving the flexibility there for cases that (as far as anyone could tell) genuinely needed it.

If anything, it’s this discussion that’s misguided because it starts from the extreme assumption that there are no use cases which can’t be handled without dynamic metadata - an assumption that doesn’t seem to have any basis in the history of Python packaging so far.

I frankly resent that characterisation of the standards process. No one even offered an improved approach, so what could we have based such an “improved standard” on? Nothing was done in secret, and there was no “clique” of privileged “standardisers”. If we don’t have anything better, that’s on the whole community for not coming up with something.

Agreed. Again, though, it’s down to someone from the community having the interest and commitment to create a PEP, manage the discussion, and produce a proposal that gets consensus from the community. And not many people seem willing to put that much effort into addressing these sorts of issue.

oscarbenjamin · November 17, 2024, 8:49pm

I don’t mean to be critical of the standards process. I think it has played out reasonably over time and improved many things. It has usually been driven forwards though by people who would have preferred to eliminate dynamic metadata but obviously needed to compromise with the existing situation as it was. The end result is that it is possible to have static metadata now but the dynamic escape hatch still exists and is widely used despite some people wishing that it wasn’t.

Sorry, I put that badly. The fact that no one offered an improved approach was precisely my point. It is not that the standards process rejected them but rather that the people who were interested in pursuing standards were not typically interested in improving support for dynamic metadata besides ensuring that the escape hatch itself was standardised so that the line of what needed to be supported by frontend tools was clearly marked. (Which is perfectly reasonable: I don’t mean this as a criticism of those people or those standards.)

In the end though we have different people with different objectives. Most projects using dynamic metadata probably don’t see it as a problem and don’t even realise that there are other people who wish they weren’t doing it. As far as they are concerned the dynamic metadata does what they want so they don’t see a need to standardise anything.

I don’t think you can do anything about the results changing mid-process. In this sort of realm you could imagine someone installing and uninstalling things mid process and then nothing is static. Unless you have some sort of lock on the environment this is unavoidable.

For the slowness part the obvious solution is to cache the value somewhere. What could be standardised is a mechanism to know if the cached value was invalidated without invoking the build backend.

charliermarsh · November 17, 2024, 9:36pm

Yeah, the SCM-based workflows make this especially hard, since suddenly the version is dependent on mutable state that is independent from the source. In uv, we allow you to declare that a package’s metadata is mutable on Git. Like, you can declare:

cache-keys = [{ git = true }]

Which would invalidate package metadata when the commit changes. But it turned out this wasn’t sufficient for some use-cases… Since in some of these SCM tools, if you tag the current commit, then the dynamic version will change. So even given a fixed commit, you can’t assume a fixed version.

pf_moore · November 17, 2024, 10:08pm

Maybe, but I’ve no idea how we could have avoided that given that people who wanted to lean into the idea of dynamic metadata were just as able to contribute, but didn’t. It’s not as if we could force people to participate in the process…

I don’t even think I’d characterise myself as “wishing that dynamic metadata didn’t exist”. I wish the problems caused by dynamic metadata didn’t exist, certainly. But if those problems can be solved without limiting the use of dynamic metadata, I’d be perfectly happy. And in actual fact, I’ve never personally been in a situation where dynamic metadata has caused me a problem - so from a purely personal perspective I’m fine with the current situation^[1].

I’d put that the other way round - the people interested in improving support for dynamic metadata^[2] weren’t interested in contributing to the standards process.

This may well be true. It’s hard to know what to do about it, though. In many ways, framed like this it simply becomes another example of how it’s almost impossible to make progress on packaging standards because we always have to assume the worst when it comes to backward compatibility - that any existing behaviour, no matter how “obviously wrong-headed” it might seem to people, will be an absolutely critical part of somebody’s (and probably significantly more than one person’s) workflow.

apart from the fact that it makes my life as PEP delegate harder, because covering all possible edge cases in a standard is more difficult ↩︎
assuming they exist - I’ve read a lot of posts about packaging, and from what I can recall, I’ve seen basically no-one say “we should make using dynamic metadata easier” ↩︎

bwoodsend · November 17, 2024, 10:17pm

Specifically for dynamic versions (and feel free to punt this off into a separate thread if it gains any legs), I am thinking of a middle ground where we extend the PEP-621 standard to allow specifying a plain filename and regex pattern to find the canonical version definition? e.g. The most common use case would look like:

# pyproject.toml
[project]
version = {
    "file": "package/__init__.py", 
    "regex": "__version__= ['\"](.+)['\"]",
}

This would be usable in place of setuptools’s version.attr or version.file and would likely also work for cases like the above mentioned version in cargo.toml but it would be trivial to evaluate, wouldn’t require invoking a backend and it tells you explicitly which file’s .mtime you need to query to verify cache validity.

I know it doesn’t solve the setuptools_scm case (which I consider unsolvable given that any non gititgored file can toggle the dirty state and invalidate the cache… and that’s just git) but I’d think it would still cover a lot of others?

(This is all assuming that merely reducing the needs for dynamic backend provided metadata without eliminating it is still useful?)

I guess the other thing I would say is that is it really the end of the world if a workflow manager errs on the side of not invalidating a cache and requires a manual tool sync command for when whenever a user touches their metadata and needs it to propagate?

oscarbenjamin · November 17, 2024, 10:39pm

It has never caused me a problem either and I’m unclear how this comes up. As you pointed out earlier the version must be static in an sdist and likewise in a wheel. If the dynamic version is a problem for you then it means that you are not installing from sdist or wheel. It doesn’t happen with an editable install though because editable installs record a (false) static version. If you install from source tree or vcs non-editably then you get a static version after install…

Maybe this is a uv thing that it actually does a dynamic version check in an editable install when running uv sync?

Or this is something to do with monorepos?

mitsuhiko · November 17, 2024, 11:32pm

I don’t believe that this is correct. For two reasons

In the future the metadata system could be changed so that for editable installs pyproject.toml is used directly, which is exactly how the JS ecosystem consults package.json
Even if we would always want to have the metadata replicated, for as long as outdated metadata can be detected, systems like uv can automatically re-install the package.

The challenge today with the dynamic metadata comes today in part because it’s not clear when to invalidate it.

I’m not sure why you conclude this. I started that discussion because I want to see what a world without dynamic metadata looks like. That world is not impossible to imagine given that it’s the status quo in other ecosystems. However it’s not true that there there is no value in it. In fact to quote myself:

konstin · November 18, 2024, 1:34pm

From my uv perspective, there’s different kinds of metadata: “load bearing” metadata such as name, version, dependencies (incl. optional deps and dep groups) and possibly license, and informational metadata, such as the final layout of the readme.

For caching and dependency resolution, it’s fine if there’s a readme transform that needs to be run for the publish-ready package; It would even be fine if the readme was declared as static but with a note to run a certain transform when generating a wheel: The untransformed readme is still good to show to the user as a tool or IDE (it works in the repo), we only need this transform since pypi has different constraints than github.

To share three use cases for static metadata:

The user depends on a package that is only available a source dist. If the source dist has static name, version and dependencies available, a resolver can write a lockfile without building or without even downloading the file.
The user depends on a git repository. With a static pyproject.toml, we can fetch a single file; without it, we need to check the whole repo out, install the build deps in a temporary environment, invoke the build hooks and only then do we get the metadata.
All major tools I’ve checked are caching source dist builds. That means if a source dist is currently emitting different metadata depending on the build context, it can already break even pip users. That is to say that there are some constraints that are already implicitly encoded in the ecosystem.

Personally, I’ve definitely seen a lot of users being confused by caching behaviors and not understanding why things don’t get updated.

As a note for the editables discussion, we can have editables depending on editables, either for workspaces or for patching dependencies, i.e. we need to accommodate cases where the editable isn’t only a leaf in the dep tree.

barry · November 18, 2024, 7:14pm

That’s “just” a missing feature for which there’s even a PR.

groodt · November 18, 2024, 9:59pm

Wheels via PyPI, standards-compliant indices, and direct_urls already have static metadata.

The ecosystem really is at the edge of what’s possible in a compatible way. Dynamic behaviour with Turing complete code execution just seems like something we won’t disappear without upsetting or breaking something for some users.

It’s yet-another-fight where consensus is never reached, but maybe an option is to group all the non-deterministic issues together. e.g. What if we consider editables, sdist and source-tree / VCS dependencies as non-deterministic somehow? And recommend for them to be opt-in for lockers / installers. Maybe that’s a line in the sand?

Im only vaguely aware of other language ecosystems such as Java (maven central) and Javascript (npm), but as far as I’m aware, their published artifact distributions might host source, but don’t handle them via resolvers by default and the majority of projects won’t be taking source distributions. Users who do take on source distributions are considered to be experts and then have the guardrails taken off without impacting the overall community.

Any standard lockfile only need to standardize on locking wheels. Any other locks would be tool specfic (only with an opt-in). It would require work and documentation and community effort to make it easier to publish and host wheels. On PyPI or other public area such as Github releases etc. There is a long history and reluctance for maintainers to publish wheels (nuisance PRs) and for pure python sdist, I can see almost understand the reluctance. But at the same time, it really does create a “tragedy of the commons” scenario. It’s nobody’s fault that things aren’t working as well as they could, but nobody is empowered or willing to make the tough decisions to improve the situation.

dimaqq · November 19, 2024, 3:18am

A comment on the problem as stated at the top of the thread.

Arguably, the most common readme approach, referencing a readme.md file is dynamic too. Invalidation is easier (a single file), and yet it can change on every commit, or even via pre-commit (end of file fixer or when bullet points getting reformatted).

I think that hatch-like readme assembler step stems from folks wanting to have a neat “homepage” for their package on pypi, and apparently that’s only set by pushing a new latest version? If so, an alternative approach would be to let maintainers edit or update their package “front page” on pypi outside of publishing.

The version, in my limited experience, stems from being able to access own version at runtime: to display in some UI, to report in telemetry, or user-agent, under the presumption that some server may track that or provide a work-around for older clients.

I’ve seen a bunch of solutions to that, from manually maintaining two sources of truth, to having setup.py read a Python file, to the opposite of using packaging at runtime to detect own version (not great if it’s library with wide range of pythons supported). Some use CI to automate the two sources of truth problem, or provide the “real” version out of band, eg a separate file or env var in a docker image.

Personally I’m a fan of pre-commit style automation and keeping pyproject.toml simple.

It’s something I’ve experimented with for requirements.txt vs pyproject.toml vs lock file. It can be done and it’s pretty clean.

If static values for these specific fields were mandated in pyproject.toml, the world would adjust.

Likewise if all wheels for a given version were required to have metadata in sync, I believe the world would adjust.

ferdnyc · November 19, 2024, 3:52am

Since people have been making comparisons to npm and the like, let me tell you a little story about Maven — Apache Maven, to give it its full title.

Maven is a Java build/packaging system, the “modern” one that anyone who’s not a graybeard still clinging to gradle or ant uses.

Despite being relatively modern, it suffers from a rigidity problem^[1] in the way that you might expect from projects with Apache in front of their name. All project data is specified in a pom.xml file with a typically verbose structure, and dynamic data was basically completely unsupported — elements could make reference to data from other parts of the config, but it all had to resolve to a static value item at some point.

So, as a result, lots of projects contain pom.xml files with hardcoded values like

<project>
  <groupId>org.example</groupId>
  <name>my-thing</name>
  <version>1.0.0-SNAPSHOT</version>
</project>

And if you have a more complex, hierarchical project structure with multiple interdependent pom.xml files, the <parent> reference in each child pom.xml would use the same hardcoded <version> specified in the parent pom.xml. Redundantly also hardcoding the child version was a nuisance, so a Maven “flatten” plugin was introduced to expand any references into static metadata for deployment purposes.

Is version 1.0.0-SNAPSHOT representative of the current checked-in version of the code? Probably not. Release-automation tools will generate a POM for packaging with a version from SCM without having to update the checked-in pom.xml contents, so the version there is basically meaningless — there just has to be SOME version. Who’s going to bother also going in and updating that value in multiple places, then committing those changes into version control?

True dynamic metadata for package version (just the package version) wasn’t supported at all until Maven 3.5.0 (2017), when CI friendly versions were added as a new feature. That feature, in typically rigid style, added three (only three) dynamic properties, representing composable parts of a project <version> value. One or more of those properties could be ${} referenced in the <version> child element of the <project> or <project><parent> element (only the <version> element, and only inside a <project> element or <parent> element inside a <project> element), with values settable on the command line when running mvn to process the pom.xml.

To prevent having to require command-line values, a hardcoded initial value for each property could also be stored in the file, while still allowing command-line override. So now, the meaningless hardcoded version value is relocated to a property in the parent POM, and referenced in any child-project POMs. We’ve successfully single-sourced our meaningless placeholder value. Woo!

Oh, and since there’s dynamic info in the child POMs, use of the flatten plugin is now required to deploy or package anything built from a child POM definition. But the flatten plugin interacts badly with other plugins commonly used for packaging and deployment, so even the CI friendly features aren’t usable in some cases.

Point of all this is:

Dynamic metadata is a thing whether the project wants to support it or not
Even projects as rigid as Apache eventually succumb to the need to provide support for it
That goes a LOT worse if it’s done after-the-fact and half-assedly

Then again, doesn’t the Java ecosystem as a whole? ↩︎

groodt · November 19, 2024, 7:36am

I think there are key differences:

From resolver / installer / package index POV, nothing is dynamic. It’s all baked at publish time.
There is arcane complexity as you describe it, but that’s handled by the publishers who are in minority compared to the downstream users in the ecosystem

I forgot about that SNAPSHOT sentinel value convention. Thanks for the memories. I think I’m correct that Maven central let’s you repeatedly publish over a SNAPSHOT version and they’re also not cached. But my memory gets a little hazy here.