PEP 723: Embedding pyproject.toml in single-file scripts

ofek · August 6, 2023, 6:07pm

The PEP has been rewritten to stand on its own rather than build atop 722.

PR: PEP 723: Embedding metadata in single-file scripts by ofek · Pull Request #3264 · python/peps · GitHub
Rendering: PEP 723 – Embedding pyproject.toml in single-file scripts | peps.python.org

AA-Turner · August 6, 2023, 6:20pm

jeanas · August 6, 2023, 7:03pm

I have a minor nit on “If a single-file script is not the sole input to a tool then behavior SHOULD NOT be altered based on the embedded metadata. For example, if a linter is invoked with the path to a directory, it SHOULD behave the same as if zero files had embedded metadata.” “Sole input” is a little ambiguous considering $linter file1.py file2.py. Should this read: “If the file is processed by a tool due to being part of a larger directory, …”?

jamestwebber · August 6, 2023, 8:05pm

I was kind of waffling on this for a bit but the current version is really nice and it feels clean to me.

The TOML document MUST NOT contain multi-line double-quoted strings, as that would conflict with the Python string containing the document. Single-quoted multi-line TOML strings may be used instead.

I’m a little unclear on what you are trying to avoid here. The terminology is a little confusing, but a standard double-quoted string (i.e. "foo") is totally fine inside of a triple-double-quoted ("""bar""") string. Of course trying to nest a triple-double-quoted string in there wouldn’t work, but it’s not useful toml and would break the script anyway. So what’s the danger here?

To me, any string that’s both a) a valid python string and b) valid TOML should be fine, and I’m not sure you need to restrict things more (except in terms of the actual contents)

edit: maybe you are worried about implicitly-concatenated strings, like the pathological

__pyproject__ = ("""[project]
"""
    """
dependencies = ["numpy"]
"""
)

But I think you can prevent this by not allowing the parentheses (a single string shouldn’t need them, and multiple strings are silly). I would prefer that restriction to “no double-quotes”, since I’d use them in my pyproject.toml files.

On that note, you should probably make it explicit that the value must be a string literal, i.e. no dynamic formatting is allowed.

Non-script running tools MAY choose to read from their expected [tool] sub-table. If a single-file script is not the sole input to a tool then behavior SHOULD NOT be altered based on the embedded metadata. For example, if a linter is invoked with the path to a directory, it SHOULD behave the same as if zero files had embedded metadata.

I’m not convinced this is the right idea–imagining the “directory full of one-off scripts” scenario, I think I’d want to run e.g. black and have it respect the metadata embedded in each of the scripts. Maybe I tweaked a formatting rule for one script because it made things easier to read, for instance.

Possibly the answer is “don’t run black on a bunch of unrelated tools” but in that case perhaps it should refuse to process my file, rather than applying defaults? I don’t know if this specific behavior should be specified in the PEP, or maybe it should be left undefined for tools to experiment with.

Note that this example used a library that preserves TOML formatting. This is not a requirement for editing by any means but rather is a “nice to have” especially since there are unlikely to be embedded comments.

Per recent discussions in the PEP 722 thread, I think embedded comments are going to be a common request. But writing a comment-preserving automatic TOML editor isn’t required for this PEP, I think.

BrenBarn · August 6, 2023, 8:07pm

I guess for me this has some of the same problems as PEP 722, and fixes some problems, and adds some new ones.

At the most basic level I still remain unconvinced that it is better to put the dependency or metadata information directly in the file rather than in a separate file. (I’m not convinced by the /usr/bin argument as I think there are usually better ways to do things than copying files in there.) And it is much easier to avoid complications if the specification is about a single line in the script file that points to an external metadata file (or some kind of automatic deriving of the latter from the former) than if the script file has to have a sub-language inside it (whether that’s TOML or the ad-hoc format of PEP 722).

On the plus side, I do prefer using TOML for this as it leverages what seems the best current way of specifying dependencies, and crucially allows adding other metadata. It also leaves things much more open for tools to use various kinds of metadata in the future, rather than narrowly focusing just on dependencies.

I’m a bit confused by the part of the PEP that specifies how the embedded metadata TOML differs from regular pyproject.toml. Specifically prohibiting the build-system table and saying we may standardize it later seems a bit odd. It seems to preclude the possibility that tools may decide how to handle it and then that behavior could later be standardized. Was that the intent (i.e., to ensure any later standard isn’t constrained by tool behavior in the wild)?

The bit about a script being the “sole input” to a tool is also a bit murky for me. I think I get the intent of it, but I’m not sure the way it’s phrased covers all cases or leaves out all cases that shouldn’t be covered. Are there other kinds of tools besides linters for which it might make sense to always take account of embedded metadata, or never do so?

Finally, I think there are some potential ambiguities in the syntax. The text spec says the script “may assign a variable”, but the regex includes the additional stipulation that the __pyproject__ name appear directly in the first column on the line. The spec does say that the text takes precedence, but this kind of thing may warrant a bit of thought. For instance, I can easily foresee people putting the embedded metadata inside an if __name__ == "__main__" block (whether on purpose or just without thinking about it much). It could be simpler to explicitly say the assignment must occur at the beginning of a line.

Similarly, what happens if someone (deliberately or accidentally) assigns twice to the name __pyproject__? To do so is legal Python but the PEP doesn’t explain how it would be handled.

jeanas · August 6, 2023, 8:17pm

Certainly this: Addendum for PEP 722 to use TOML - #6 by abravalheri

jamestwebber · August 6, 2023, 8:38pm

Today I learned you only need a single backslash to escape a triple-quote! That makes sense then. I guess I was misusing terminology anyway, as a double-quoted string that uses an escape to span multiple lines isn’t a “multi-line double-quoted string”. That’s what I was concerned about.

ntessore · August 6, 2023, 9:02pm

On the point of restricting the contents of the embedded metadata, I think the proposal is right in focussing on the “run scripts with dependencies” angle as an established use case, but should not close the door on other use cases, which I imagine will crop up almost immediately should the PEP be accepted. If the PEP states e.g. that build system metadata “must not” be embedded, how will we find use cases for a potential further PEP in that direction, mentioned in that same breath? I think it would be better to say that any behaviour not described in the PEP is undefined for the time being, but not flat out disallowed.

Anecdata

For example, I have used the exact format being proposed here for a plugin system that used entry points and single source files. The embedded metadata was parsed in a first step, and extracted into a regular pyproject.toml file for installation with standard tools in a second step as needed. This very effectively solved the constraints of that particular situation, but would have been disallowed by the proposed restrictions.

jeanas · August 6, 2023, 9:15pm

In your specific plugin system use case, I guess you could just let the tool add [build-system] while converting to a source tree, just like you might already have to add name and version?

If you wanted to experiment with building PEP 723 style scripts and you wanted the build backend to be specifiable, you could just use a [tool.$tool] table where you put the same information. ^[1]

Edit: I think the advantage of this point of the PEP (“MUST NOT contain a build-system table”) is that there is a strong expectation that metadata outside the tool table is standardized. That seems like a useful aspect to keep.

I doubt that would provide much value though, because there aren’t 100 ways to build a single-file project. The major build backends for pure Python code differ in things like how they specify included/ignored files, or whether they allow plugins. Ok, the latter could be useful in the abstract, but I can’t imagine a use case where there is also a strong motivation to keep the project single-file. ↩︎

pf_moore · August 6, 2023, 9:17pm

But this does make me wonder - is the following valid?

__pyproject__ = """
[project]
""" \
"""
dependencies = ["requests"]
"""

And if not, then where precisely in the PEP does it say it’s not allowed? Yes, I know nobody should ever do this. It’s a pathological edge case. But every time we’ve accepted a PEP on the basis that “people should be sensible” we’ve had problems, because someone hasn’t been sensible.

My point here is simply that the spec as written isn’t sufficiently precise. It’s not a fatal problem, and in fact it could be fixed by the simple expedient of declaring the regex given in the PEP as the formal definition of how to parse the script for the data. (Well, no, for example it still needs to state what happens if there are two valid assignments to __pyproject__, but that’s a separate issue…) But it does need to be fixed if the PEP is to be usable as a specification (IMO).

jeanas · August 6, 2023, 9:32pm

Another comment:

Non-script running tools MAY choose to read from their expected [tool] sub-table.

I wonder if it would make sense to add language like:

“Tools serving purposes unrelated to packaging (such as linters or code formatters, but not build frontends) which accept both single files and directories, and which can be configured though the [tool.toolname] table of a pyproject.toml file when invoked on a directory, SHOULD also be configurable via [tool.toolname] in the __pyproject__ of a single file.”

Basically strengthening MAY to SHOULD, excluding build frontends.

Otherwise, it’s not fully clear from the PEP whether Black, Mypy, Ruff, Pylint, etc., are officially encouraged to read inline __pyproject__ config (I personally think they should be), or whether it’s just an option at their discretion.

pf_moore · August 6, 2023, 9:50pm

Has anyone asked any of those tools whether they support this idea? It seems rather important to make sure that the intended users of this option are interested. As well as asking them how they would feel about the question of “only use the embedded data if you’re looking at just the file containing it and not if it’s part of a directory”. If I were a tool maker, I’d be very reluctant to support something like that, so assuming it’s fine just because the PEP says so seems optimistic at best.

ofek · August 6, 2023, 10:32pm

Yes, maintainers will respond soon!

thejcannon · August 6, 2023, 10:39pm

Speaking on behalf of Pex and Pantsbuild, we will support whatever decision is accepted, and neither seem technically infeasible.

This includes if later on, additional facilities are introduced in top of these (e.g. other metadata to treat single scripts as packages, or tool configs embedded in the toml, or replacing the dependencies with a locked set)

hauntsaninja · August 6, 2023, 11:37pm

(as a maintainer of mypy)

For mypy, I’d be supportive of using embedded metadata to configure mypy for a single script. I think this would create a more consistent configuration experience for mypy. The set of CLI options or inline comments is not as expressive as what you can do with a config file (and you can’t check CLI options into the same file).

Like others, I was surprised by the “only use the embedded data if you’re looking at just the file containing it and not if it’s part of a directory”. This would solve some issues for mypy (there are some things you can only configure globally in a single mypy invocation), but will certainly cause surprises for users and limits the applicability to just a single script case. On net, I think this prescription is probably undesirable — in the code quality ecosystem, people seem to really like having a single invocation of a tool running on all the things, and integrations like pre-commit expect this.

I definitely like having more structured per-file configuration though. The current state of the art is special comments and I think this could work better. It’s often not expressive enough. It’s not special enough to avoid common mistakes, for instance, if you accidentally delete some code and end up with a module level # type: ignore, you’re going to be unhappy. Here’s a recent example of a similar issue in ruff: Ruff v0.0.281

(as a maintainer of Black)

I’m new to maintaining Black and Black is a project that actively dissuades its users from configuration, so I need to think more before I say anything.

(general comments)

Overall, I was surprised by how much people like putting tool-specific configuration in pyproject.toml. Reducing the number of configuration dialects and extra files seems to have been valuable; I’m not sure the authors of PEP 518 anticipated this. PEP 518 prescribes almost nothing, and certainly did not coordinate with non build tools in the manner we’re doing now, but it created a Schelling point and the community ran with it.

Since I already prefer the embedded TOML format for the core dependencies use case, I think it’s a bonus that PEP 723 allows for further serendipity in this space.

FRidh · August 7, 2023, 6:37am

Thanks for writing this draft so quickly, I like it! My preference is still somewhat for a block comment, as I thought it would be easier for tools to store their lock files that way a well, but I understand the motivation against it. Tools that want to store their lock file could support doing that in the same pyproject.toml as well although they would lose a level of nesting.

The risk here is part of the functionality of the tool being used to run the script, and as such should already be addressed by the tool itself. The only additional risk introduced by this PEP is if an untrusted script with a embedded metadata is run, when a potentially malicious dependency might be installed. This risk is addressed by the normal good practice of reviewing code before running it.

It may be worth mentioning here that further locking could be done by specific tools that would additional (optional) metadata.

pf_moore · August 7, 2023, 3:48pm

Two other pyproject.toml fields that might need a rethink for the embedding case are readme and license, both of which (in the existing case) refer to external files. If scripts choose to store this data, they will almost certainly want it to be embedded in the script (for all the same reasons they don’t want a separate pyproject.toml file).

How will this proposal support embedded readme and license data?

jeanas · August 7, 2023, 3:59pm

readme = {text = "Foo bar baz read me.", content-type = "text/markdown"} and license = {text = "BSD 3-clause license"} are valid per the spec.

pf_moore · August 7, 2023, 4:07pm

Ah, I’d missed that (I did check, honest!)

Even so, readme text is often substantial, and I’d imagine people would either want to reference the script docstring, or use TOML triple-quoted strings, to include substantial blocks of text.

Licenses (if present) are typically added as a big chunk of boilerplate comment - especially if this is some sort of corporate environment (“All rights reserved, you can’t use this for other than the stated purpose without permission, …”) I don’t imagine a legal department would be too happy with summarising that as a one-liner, and even reformatting as anything other than a comment block might be problematic.

Certainly in the environments I worked in, I’d be very wary of adding a license like this.

jamestwebber · August 7, 2023, 4:25pm

They can use single-quoted string literals, at least. If their README is getting so complicated that they want to nest multiline strings inside it, or do something else that isn’t allowed in a string literal, that is probably a sign that they shouldn’t be trying to keep everything in one file.

More broadly on that point, it’s probably good if any solution for single-file metadata has some idea of when it’s potentially harmful to use it. It would be a shame if enabling this reduced the use of real packaging tools in favor of ten-thousand-line monstrosities.

This seems like something that’s totally up to the user, though, as it is now with pyproject.toml.

For many single-file scripts I’ve seen, the license is embedded in a comment at the top. One could still do that, and then use the OSI shorthand in the toml.