PEP 621: Storing project metadata in pyproject.toml

pf_moore · June 26, 2020, 6:57am

(Somewhat off-topic, so if this needs further response I suggest we open a new thread) Pip already tries very hard to avoid a build step when getting metadata. I’d see pyproject.toml as being most useful for dependency data (which isn’t available elsewhere until you build) and for “source trees” (which have no filename to parse).

Dynamic versions annoy me as the maintainer of a metadata consumer, but the benefits of getting rid of them are likely far more to do with code complexity than performance (sadly).

brettcannon · June 26, 2020, 6:13pm

I have merged the simplified Motivation section thanks to Dustin and Pradyun’s reviews!

pganssle · June 26, 2020, 6:17pm

I think that people would simply not adopt this [project] table if it were not possible to specify at least the version dynamically at build time, and probably many other things. It would be a pretty serious regression, even if you ignore everyone who wants their git metadata to be the single source of the version number.

I would be somewhat surprised if this significantly improved the situation with regards to the ability to dynamically process metadata. It will probably be convenient when someone happens to use it and happens to not use any dynamic fields, but I think the real solution to the “static metadata” problem is to standardize metadata in sdists. Once setuptools and other build tools start rolling out changes that standardize metadata, huge swathes of the ecosystem will be opted in automatically, and it should be canonical. I can imagine situations where “parse the metadata from pyproject.toml without a build” might be useful, I don’t think it will be a significant long-term or short-term solution (unless we get this done and then really drop the ball on metadata-in-sdist, or it gets super wide adoption immediately) for things like building a resolver.

I say this not to just try to shoot down the idea that we should be parsing metadata from this, but to try and frame the discussion a bit about what we should be optimizing for. With regards to “static metadata”, I think the ordering of our priorities should be:

Pave the way for a future where sdists can contain useful and canonical static metadata files.
Make it easy for back-ends to adopt this without any loss of functionality.
Design it in such a way that tools can know when it’s safe to parse static metadata directly from the pyproject.toml.

I think the PEP as it stands does a good job of this: fields that are tool-provided are marked as such, no existing backends care about the few fields that are not allowed to be provided dynamically (as far as I know).

brettcannon · June 26, 2020, 7:07pm

I would be quite happy to start the discussion of standardizing sdists after this PEP is done as that is something I would like to see dealt with in some form. Maybe this will inform that work by saying anything in dynamic must be provided in some supplementary form. Maybe it doesn’t come into play at all and it simply acts as a way to encourage users and build back-ends to specify as much as they can statically upfront and it’s more useful for tools analyzing projects from their source code.

And so I’m happy to have that general goal of dealing with the sdist standardization in the back of our head, but I don’t know if it will necessarily dictate how this PEP turns out.

dstufft · June 26, 2020, 10:50pm

This might be wrong, but I kind of feel like this PEP should probably be targeted specifically at build tools as the consumer of this static metadata for producing builds, and non build tools should still be expected to continue to go through the existing hooks.

I worry that the current PEP gives the impression that given a sdist, you should start reading from this file, and I think that’s the wrong path to go down. There are things that are OK to be dynamic when you’re in development or producing packages, but once you’re in a sdist should no longer be dynamic. Version is a big one that comes to mind. This PEP doesn’t explicitly tell people to start doing that, but it might be good to explicitly call this out if people agree?

brettcannon · June 27, 2020, 5:11pm

If you look in the Motivation section there’s only a single bullet point that doesn’t say “build back-end”.

I think both you and @pganssle have some notions about sdists which are not written down anywhere and thus have not been fully communicated since sdists are obviously not fully static by definition since there is no definition .

If it will make you and @pganssle more comfortable with this PEP then I am fine with explicitly stating in the Motivation that this is meant for people working from a source checkout to either analysis purposes or for a build back-end to produce an artifact at which point the build artifact’s metadata is considered canonical. In the eyes of this PEP, a source checkout an sdist is a build artifact and not equivalent to a source checkout.

brettcannon · June 27, 2020, 5:25pm

@dstufft @pganssle and co-authors: I opened https://github.com/python/peps/pull/1474 to clarify how what this PEP proposes shouldn’t be considered the metadata for an sdist. As usual I won’t merge until I have co-author sign-off.

brettcannon · June 28, 2020, 10:42pm

And I got the sign-off, so the change in the Motivation section has been made.

hlovatt · June 29, 2020, 7:59am

It seems like a lot of information at the top of a file, can it be put at the end of the file as standard? In ‘normal’ writing you would put acknowledgements, references etc at the end to save clutter.

FFY00 · June 29, 2020, 2:45pm

Hi, I finally got a chance to properly look over this PEP. I am a bit skeptical.

First of all, I agree it would be great to have a common standardized section to specify core metadata. I do, however, worry about the implications this will have due to the “escape hatch” mechanism.

My main worry is that external tool will start relying on this metadata to get information about the project, or something similar. That would not work well due to the dynamic mechanism. This might not be the intent of the PEP, but I think it is one very possible side-effect.

Actually, there is one sentence that seems to recommend this.

Finally, this PEP is meant for (…) those doing analysis on a source checkout.

But I am not sure my interpretation is correct.

There is no way you can rely on this metadata. Any mechanism that makes use if it will just implement heuristics, which obviously cannot be relied on, and will turn out to be a long term issue (let me clarify, it might work well short term, with a small sample size, but as you scale it, the bad design will start to show). I think this will be a very attractive option to people getting started designing their tools, and that’s what worries me. We can’t police every tool and let authors know this is a bad idea.

Although I agree it would be good to have a standardized place for the core metadata, I do need see a need for it. And I am not sure it is worth the possible issues introduced by this.

A possible way to mitigate the metadata being misused by external tools would be to maybe introduce PEP517-like hooks to fetch it. This would also allow the backend to do all the required normalizations, for eg. right now the only field you can rely on is name, and it needs to be normalized as per PEP503.

My goal with this reply is not to oppose the PEP but rather to question if this is indeed the best step forward? I do not believe there is any right solution, at least not at this moment. This is something that might work out well, or not, we will only see the results in the future. With all that said, I would advise you to be careful when proceeding, and to make sure you think everything through.

pf_moore · June 29, 2020, 2:57pm

You 100% can rely on the data. At least, in the sense that if a field is not mentioned in dynamic you know the final value, and if it is you know that only the backend can tell you. People writing tools that don’t take that into account are simply not following the spec, and yes, that may happen, but we shouldn’t worry too much about that.

PEP 517 already has this hook, it’s prepare_metadata_for_build_wheel. People wanting to reliably introspect metadata should use that - at least until there’s a standardised sdist format, after which introspecting the sdist is another option.

As we’ve tried to indicate in the rationale, this PEP is defining how users write the metadata, so that backends can read it in a uniform manner. We acknowledge that people might use the data for other reasons, but it’s not the core purpose of the spec. The rationale section was recently revised to try to make that clearer - I don’t know if you’re reading the latest version, but if not check to see if that addresses your concerns at all.

FFY00 · June 29, 2020, 4:15pm

Yeah, of course, what I meant is that you cannot rely on the data being there.

Ah, right! I had forgotten that it could be used to achieve the same

Yes, I am looking at the latest version. I am not opposed to how the current text is written, but I would like it to be clearer that most fields can be dynamic. If someone does not read the PEP in its entirety, I think it’s easy for them to miss that. Again, the current text is fine, I would just like for it to be a little clear on this

brettcannon · June 29, 2020, 10:41pm

This is structured how all PEPs are structured.

I tried to clarify this via PEP 621: clearly specify that metadata specified is static, but it's … · python/peps@7e4d254 · GitHub (I didn’t bother co-authors on this since it was just a clarification point of something the PEP already said).

brettcannon · June 29, 2020, 10:52pm

Due to how long this topic has already gotten, I have started PEP 621: how to specify dependencies? to explicitly discuss the one open issue in the PEP.

dustin · June 29, 2020, 10:59pm

Are you talking about the PEP, or the pyproject.toml file?

FFY00 · June 30, 2020, 3:00pm

Thanks! This addresses my concerns

brettcannon · July 1, 2020, 12:41am

I just committed a change where we reintroduce maintainers so there can be a separate discussion and potential PEP to deal with what the true differences between Author and Maintainer are (if any).

takluyver · July 1, 2020, 10:32am

I share the concerns other people have mentioned that this invites people to consume static metadata and ignore the possibility of it being dynamically generated. Yes, that’s technically against the spec, but given the broad confusion @pganssle pointed out, there’s an excellent chance people won’t know that. And it’s such an attractive shortcut - never mind about hooks, environment setup, installing build system dependencies; just read the information from a static file and you’re done!

So I can imagine that anything that 90% of projects specify statically becomes effectively mandatory as lots of smaller tools rely on it. Version would probably be safe from this because quite a lot of projects want to read it from somewhere else, but pretty much any other field could be affected.

I don’t know if this is necessarily bad overall. Many things would be easier with static metadata. But I think there are cases where you need to determine e.g. more specific runtime dependencies from the build step, and I don’t know how to ensure that when it’s probably only 0.1% of packages.

ncoghlan · July 1, 2020, 11:07am

One meta-comment on the PEP structure: it’s using the deprecated style where it tries to use the PEP itself as a living specification

Instead of doing that, the PEP should point to a PR that adds the proposed specification for tooling developers to use to https://packaging.python.org/specifications/, while the PEP focuses on the meta-commentary role of explaining not only what’s included in the specification (and why), but also what you’ve deliberately chosen to leave out of the specification.

Trying to have one document that both provides a readable specification for tooling developers and also provides the rationale for why that specification is the way it is for the benefit of reviewers forces trade-offs that we don’t actually need to make anymore.

ncoghlan · July 1, 2020, 11:44am

Since I didn’t mention it earlier, I should note that I’m broadly in favour of this idea, but I share the concern raised by others that there are practical issues we need to work through to avoid encouraging the creation of artifact analysis tools that adopt an introspection approach that is simple, easy, and wrong.

I think the main way to tackle this would be for the PEP to explicitly allow build backends to mutate pyproject.toml when creating the sdist. I’m less worried about it for source directories, as there’s a simple self-selection process:

for use within a project, established projects simply won’t adopt tools that don’t support the metadata input format that they use, while fans of a particular tool are likely to be willing to adapt their metadata input practices to conform to its limitations
for broad analysis across multiple projects, tools already have to deal with all kinds of malformed input, so their authors aren’t likely to be tempted by attractive shortcuts when specs clearly spell out why the shortcut isn’t enough to cover the general case

In this initial iteration of the PEP, that could take the form of the following statement:

When build tools are constructing an sdist from a source directory they MUST delete the [project] table (if present) from pyproject.toml. A future PEP will cover a standardised mechanism that allows inclusion of static project metadata in an sdist when that metadata will be identical across all wheels and local package installations derived from the sdist.

As my current expectation is that any such future PEP would allow sdists to include metadata in a format that looks more like wheel and installation DB metadata, requiring build tools to delete the [project] table eliminates the potential for that table to become an attractive nuisance to authors of code that looks at sdists rather than source directories.

If we change our mind about that later, “build tools don’t need to delete the [project] table from pyproject.toml any more” is a much more manageable policy change than “ouch, there are all these already published sdists with confusing [project] tables that it’s now too late for us to do anything about”.