PEP 722: Dependency specification for single-file scripts

DavidCEllis · August 15, 2023, 4:45pm

Part of the goal for my use case is that if the tool is fast enough then I don’t need to build an environment with the dependencies as the developer. I’m not sure there’s much to win with mtime caching over parsing the current block (it takes something like 0.2ms to actually parse) and it’s not going to allow me to share environments between scripts with the same dependencies.

I’m using python -Ximporttime -c "import tomllib" it gives a nice breakdown of all of the modules being pulled in on import. I’m also using hyperfine for rough timing including the time taken for python to launch.

I’m fairly sure that your command imports it once and then just looks it up in sys.modules for every subsequent iteration.

jamestwebber · August 15, 2023, 4:54pm

I was thinking you could parse on any cache miss, and then cache the link from script to env. So if you parse a new script’s requirements and already have a compatible environment, you could re-use it and save that link.

I don’t know if this is a viable design in your case, just seemed like something I’d want to do if startup time was a concern here.

ofek · August 15, 2023, 4:56pm

I always do the following (and run it a few times):

python -m timeit -n 1 -r 1 "import ..."

jeanas · August 15, 2023, 5:15pm

You could simply cache a file path => TOML string dict if that is the approach you choose.

Case in point. After installing pytomlpp in a venv, I can observe a startup time that is about twice as much as the time to start Python itself, which is in line with your argument.

$ hyperfine --warmup 10 "python -c ''"
Benchmark 1: python -c ''
  Time (mean ± σ):       8.1 ms ±   0.9 ms    [User: 7.8 ms, System: 0.8 ms]
  Range (min … max):     7.0 ms …  11.4 ms    210 runs

(venv) ~/snakerun $ hyperfine --warmup 10 "python -c 'import pytomlpp'"
Benchmark 1: python -c 'import pytomlpp'
  Time (mean ± σ):      23.8 ms ±   1.1 ms    [User: 19.0 ms, System: 4.6 ms]
  Range (min … max):    22.1 ms …  26.6 ms    116 runs

But once I change the start of pytomlpp/_io.py from

import os
from typing import Any, BinaryIO, Dict, TextIO, Union

from . import _impl

FilePathOrObject = Union[str, TextIO, BinaryIO, os.PathLike]

to

from __future__ import annotations

import os
# from typing import Any, BinaryIO, Dict, TextIO, Union

from . import _impl

#FilePathOrObject = Union[str, TextIO, BinaryIO, os.PathLike]

I get

(venv) ~/snakerun $ hyperfine --warmup 10 "python -c 'import pytomlpp'"
Benchmark 1: python -c 'import pytomlpp'
  Time (mean ± σ):      12.4 ms ±   0.9 ms    [User: 9.6 ms, System: 2.7 ms]
  Range (min … max):    11.2 ms …  16.5 ms    223 runs

So by shipping its type hints separately you would reduce the import time to ~half of the time to start up Python. And I did this without looking deeper into any of what pytomlpp does.

Bottom line: you spent a little time and effort optimizing your code for startup time (e.g., you say you didn’t use regular expressions for that reason). You could also put a little time and effort into contributing import performance improvements to one of these libraries (for example the typing change in pytomlpp, or refactoring tomllib, which, from a cursory glance, uses regular expressions and compiles them on import).

(Sorry, I originally posted a draft of this by accident.)

DavidCEllis · August 15, 2023, 6:16pm

It’s fair, but I’m not concerned with performance on the sub millisecond level for this specific task. I’m actually somewhat concerned that I’d implement this and the extra lookups would take just as long in the end.

I’m perfectly willing to put time and effort into contributing performance improvements where I believe it is appropriate to do so. [example]

I do not believe either of your suggestions are appropriate recommendations. Both would generate an extra maintenance burden on the developers for little^[1] or no^[2] noticeable performance gain.

Typing is widespread enough that removing import typing from one module probably won’t make a difference for most projects as it’s being imported somewhere else. Maintaining separate type stub files that are accurate is not an insignificant maintenance burden for development though. ↩︎
The tomllib regexes are used in the main function of the library, the import performance boost would be lost as soon as you actually parse a toml file, which is the whole point of importing the library. ↩︎

pf_moore · August 15, 2023, 6:22pm

I’d suggest that discussions on how to optimise this (or any) particular implementation are off-topic here. The basic message is that “for some applications, TOML parsing has a non-trivial impact on performance”. And that’s relevant because startup time for a simple script is important - this has been noted on many occasions.

Whether script runners can be optimised, or caching can improve performance, is simply demonstrating that “it’s harder to write a performant script runner with TOML data than with a simpler structure”.

And even then, whether the difference matters is something that will ultimately be decided by choosing one proposal over the other. Not by people demonstrating that optimisation is possible.

brettcannon · August 15, 2023, 7:00pm

It lean into the “it’s an optimization thing”, if you don’t cache already then I suspect the TOML reading is minor compared to the network overhead of communicating with PyPI.

thejcannon · August 15, 2023, 7:52pm

(I’m skirting being pedantic here, but) I think there does exist a world where we use TOML for more than just the dependencies, and therefore you’d be forced to re-read the TOML for the equivalent subset of information. ^[1]

I’m pro TOML, but want to be the most precise here ↩︎

DavidCEllis · August 15, 2023, 8:57pm

This was from a caching perspective. My hope is to be able to cache based on the specification extracted from the block, rather than using the raw text or some other measure that’s not as directly correlated to the environment that’s going to be created.

Obtaining the specification from PEP-722 is relatively straightforward and fast, using a TOML format requires more work by the parser. Either I need to accept that performance hit or find some other element to base the caching on^[1]. The difference is relatively small in the context of existing tools like pipx, hatch or even pip-run so perhaps it doesn’t matter.

Anyway this was mostly intended as a way of expressing a preference for the plain list of dependencies proposal and explaining why. Apologies for the performance tangent this lead to.

or write it in another language. ↩︎

pf_moore · August 16, 2023, 4:17pm

OK, I’ve added those two items to the PEP.

Also, I’ve come to the conclusion that it’s unlikely that @ofek and I are likely to reach any sort of compromise on a combined PEP. I think our goals are simply too far apart for us to be able to agree on something that we’ll both be happy with. Also, I think that enough people have commented on this thread in support of a simple comment-based format, that it wouldn’t be fair to simply switch to a TOML-based format - and the only way I can see to avoid that is to submit PEP 722 as it stands for approval. So I’ll leave @ofek to finalise PEP 723, and it’ll then be up to @brettcannon to make the final decision.

As I’m going to be away next week, I don’t expect to make any further changes to PEP 722, or even to be following any further discussions, so this can be considered the final yes-actually-he-means-it-this-time version of PEP 722 .

ofek · August 16, 2023, 5:11pm

Oh, I thought we were close to agreement. No worries I will update 723 over the weekend!

pf_moore · August 16, 2023, 5:27pm

I felt like I was just conceding more and more, and when I stopped to think about it, I had given up on more than I was actually happy with, between the TOML format, the parallels with pyproject.toml (implied rather than explicit, but still there), not deferring requires-python to a later spec, etc. And none of the objections I raised in PEP 722 had actually been addressed, they’d simply been ignored in the interests of compromise.

And on the level of intention, it felt like I wanted to standardise existing practice, and you wanted to design a new feature, and I think that’s an important distinction which we’d made no progress on resolving.

Ultimately, I guess I felt that I’d moved my position far more than you had, and that didn’t feel right.

But regardless, thanks for being open to the idea of a compromise solution. I hope you can come up with a good final version of 723, which makes your arguments the way you want it to. And then it’s up to Brett.

thejcannon · August 16, 2023, 5:48pm

I am worried that there’s a chance we miss the forest for the trees when “choosing” between PEP 722/723.

I voiced this in PEP 723: Embedding pyproject.toml in single-file scripts which got split to PEP 723: use a new `[run]` table instead of `[project]`?, but it applies here as well.

I can speak personally, that if I had to choose between these two PEPs I would likely settle on PEP 722, yet it might not be as robust or flexible as a solution that finds a way to embed structured metadata into the script (that isn’t labeled as “pyproject.toml”, as I know you and others have concerns). Then we’re forced to either find a way to “extend” the embedded metadata approach of PEP 722, or invent a new one to replace or live alongside PEP 722 ^[1] once we want to embed more.

I honestly have the capacity and stamina to co-author a PEP to lay the groundwork taking everything said here (and the other 3 or 4 parallel discussions), but I also know that’s somewhat rude and stressful to you (Paul), Ofek, and Brett. So I’ll just voice my concern, and won’t pursue that unless asked by one of y’all.

EDIT: (I’m backpedaling on the “accidental” wording here, and the expectation of this syntax being extended after re-reading the PEP)

Regardless of my concern, I’m personally excited that we’re addressing a gap in support for what I perceive is a decent chunk of Python’s usage. So
And professionally, I’ll parrot myself from the PEP 723 discussion

Speaking on behalf of Pex and Pantsbuild, we will support whatever decision is accepted, and neither seem technically infeasible.

and then people lob tomatoes and scream fragmentation. ↩︎

merwok · August 16, 2023, 6:11pm

This is the crux I think (not picking on you personally but using this message to point this out):
PEP 722’s goal is not to embed metadata into a script. It just wants to allow script authors to write down their dependencies so that script runners can make them available.

thejcannon · August 16, 2023, 6:42pm

Maybe it’s pedantic, but that most certainly is metadata (whether its called that or not on the tin).

From Wikipedia:

Metadata (or metainformation) is “data that provides information about other data”,[1] but not the content of the data

The dependencies of a script is data that provides information about the script.

…Actually, it IS called that on the tin! The PEP itself in the “Rationale” section:

We define the concept of a metadata block that contains information about a script. The only type of metadata defined here is dependency information, but making the concept general allows expansion in the future, should it be needed.

(That’s not even my emphasis, “metadata” is already emphasized in the PEP)

And that’s the entire point of my concern… We accidentally standardized a way to embed metadata.

EDIT: Although, from the PEP I should backpedal the “accidental” part. I’ll own up to that.

thejcannon · August 16, 2023, 6:46pm

(I need to make a dedicated comment for this, apologies for the noise)

I’d like to rephrase my concern (and apologize for my earlier wording), since I see now we’re explicitly standardizing a way of embedding metadata.

My concern is that we standardize this way of embedding metadata through the lens of this use-case and not others. So still the same forest/trees concern, but certainly not accidental (and again, my apologies)

BrenBarn · August 16, 2023, 6:49pm

I am concerned about that too. However, @brettcannon indicated on a related thread that there will solicitation of user feedback in the process, and that it remains quite possible that that feedback will indicate a solution that differs from both proposals. That alleviates my worries somewhat.^[1]

I do still think it would often be useful, not just for this PEP but for many ideas in the past and future, to get such user feedback at an earlier stage and use it to inform the initial drafting of a PEP, but still, getting it and taking account of it at any stage is valuable. ↩︎

pf_moore · August 16, 2023, 9:39pm

This is absolutely correct.

To expand a little, PEP 722’s goal is to standardise existing practice in the area of allowing script authors to write down what distributions they need available in order to run.

It seems to me that PEP 723 is trying more to address the question of “how do we expand the idea of metadata (pyproject.toml style) to cover single Python files”. That’s a much bigger question, and while I think it’s potentially worthwhile to address, I don’t think we yet have sufficient evidence that people need that capability^[1], and I definitely don’t think we’re even close to a good idea of what such an expansion should look like.

I understand that some people are uncomfortable with small, incremental improvements, as they are concerned that such changes ignore the longer term. I disagree with this, personally. My view is that incremental change is crucial if we are to progress - endlessly debating the “long term” will simply drain everyone’s energy, and responding to every proposal with “but what about the bigger picture” ignores the reality of how volunteer-driven projects tend to progress, which is in terms of small, focused PRs and closely scoped feature developments.

Outside of the limited situation where it addresses the “running a script with its dependencies” case. ↩︎

pf_moore · August 16, 2023, 9:42pm

You’re reading an out of date version of PEP 722. The current version removed the concept of a “metadata block”. This was all covered earlier in this thread, although I can understand if you missed it - there’s a lot to keep up with (I should know, I’m exhausted!)

thejcannon · August 16, 2023, 10:09pm

I should’ve known better than to atke the link from the very top OP.

So I redact my redaction? This is getting confusing

I think everyone gets the gist (I hope) I’ll tap out before I hit strike 3