Please feel free to submit a PR for the editables package to implement editable installs without .pth files. I’ve not been able to come up with a viable approach, and no-one else has offered a solution, so if you have one it would be immensely useful.
It’s software, we can make anything work! But maybe we shouldn’t? Let’s think about it!
The upload mechanism is “clunky” enough that there’s a PEP devoted to what that might be in its next evolution - and lengthy dpo thread - so I won’t weigh in on that here.
Today there’s very little “statefulness” evaluated between any given release - in fact warehouse has outstanding issues around distribution file metadata being dissimilar within a given release, much less subsequent ones.
I also think it’s interesting of where we’re considering responsibilities to lie, and how that impacts the ecosystem. Adding a policy on PyPI to disallow most projects (but not all, since there’s legitimate uses) from uploading a .pth file puts the responsibility on PyPI always being correct, and not doing anything about it client-side/in the core implementation.
For my money, I’d much rather a decision to disallow .pth files completely, in favor of some {waves hands} some other safe startup loading mechanism {end wave}, and then we work through the 5-year mission to selectively allow on PyPI, while also working on the ecosystem to solve, after which we stop allowing them completely.
That’s a huge hand-wave there (as I’m sure you realise), since any startup loading mechanism is going to be just as unsafe. There’s nothing inherently unsafe about .pth files, but there is something inherently unsafe about running code that you didn’t explicitly invoke (and an argument about whether installing a package is “explicitly invoking” it - certainly for automated builds, where everyone bypasses confirmation prompts, it usually is).
I expect the only likely place we’ll end up is installers printing a big message “package ‘spam’ just registered to run on startup”, and hopefully we can find wording that makes it sound like a good thing when you replace “spam” with “coverage” and a bad thing when you replace it with something you didn’t expect. Whether the installer is looking for ^import .* in *.pth or some other mechanism doesn’t really change what it’s warning about.
Absolutely - I can wave my hands with the best of them. ![]()
I’ve been thinking about this recently as well, in the context of setup.py being executable code as well, and a common vector for abuse - so I’d hope that whatever we come up with for .pth could also be extended for setup.py executions.
“Your software engineers were so preoccupied with whether or not they could…”
Indeed, the real question is almost always “should we?”
I choose the “and” option here. PyPI can consider steps. And the runtime can. And tools can.
I’m not even sure if pth files are the right place to focus. There are at least a few signals that MAY indicate a problem:
- pth files are added
- build backend changes
- sdist-only release
- introduction of new package dependencies
- release without attestations
- post-version release (maybe this counts?)
- replaces stdlib modules
I’m sure we could get creative and invent a big list of things. But actually doing the analysis to assess all packages would be a sizable new feature set for PyPI.
Also, note that individually these signals are mostly innocuous; not all are equally useful indicators of something fishy.
I press the point on the PyPI side only because it is a solution which doesn’t require handwaving on technical fronts. I’m just handwaving away logistical and funding challenges. You know, the really hard stuff. ![]()
Taking that to its conclusion, are you going to flag import loaders? Are you going to flag modifications to APIs? Yes, right now, it’s .pth files, but detecting that is not going to solve the problem. At best, it just shifts where the problem is noticed at, while remaining within the same distribution mechanism.
This isn’t the responsibility of PyPI or of installers, and the kneejerk reactions to this don’t seem to be capable of preventing the next package compromise. Nothing about the suggestions here would prevent the next compromise from being successful, as none of the suggestions here improve the security posture of anyone involved. Installing packages without reviewing is inherently a choice of blind faith, and the malicious code could just as easily be in any likely to be encountered code path belonging to the original library.
I kind of doubt it, but that does point the way to something I’ve been thinking about. Much like we’ve replaced the executable setup.py with the declarative pyproject.toml, I think a possible way out of the “pth conundrum” is replacing the semi-executable pth file with a declarative configuration file.
Imagine a TOML file with two types of keys, one that names paths to extend sys.path with and another that names entry points to run at startup. Now, Python can parse these at startup before running any code, and we can layer policies[1] on top of those to decide which and when to enable them.
These two bits can even be designed and implemented separately. Start by working out what the TOML files look like, and implement parsing and processing them in site.py. I bet you could even get something like that into 3.15. Then, over time, work out whatever policy mechanism makes sense.
I really think we’re forgetting that it was a leaked token, not a .pth file, that allowed the malicious upload to PyPI. Given that people usually install packages to run them, making the malware auto-loadable to include the minority of people who install a package to let it sit forever dormant in their environment is nothing more than a cherry on top. I’m sure this malware author wouldn’t have missed that bonus feature if they couldn’t have it.
And I’m also sure that the next release hijacker can read a publicly available list of PyPI/pip/$tool enforced “ways to draw unwanted scrutiny to your release” and simply not do them. None of the criteria being suggested here for suspicious packages (possibly par attestations being dropped) are things an attacker needs or will have any difficulty avoiding if they know to avoid them.
(That said, I do support wanting a better replacement to .pth files – particularly for the ones used by editable installs – just not from a security standpoint)
given that the attackers also controlled the associated github account as a result of the compromise, even attestations being dropped wouldn’t be a real barrier.
I wouldn’t mind something better, but I 100% agree that this isn’t something that should be viewed as a security measure, and shouldn’t be rushed. At most, it’s something for those building security scanning tools to take extra note of, and for projects that have valid uses of to continue to use.
So convert a .pth file into a TOML file which separates out the path extending with the code execution? As long as we make it cheap to discover the file like it is for .pth files I think that’s a reasonable idea. I’m assuming TOML so it’s easier to audit by a person (versus JSON which is probably easier to parse overall but potentially harder to read)?
I think one benefit to the entry point approach is it probably makes malware scanning easier as I bet most scanners don’t check .pth files. As well, obscure .pth file code isn’t unreasonable due to making it fit on a line while doing that in project code is way more suspicious.
Are we serious enough about this idea to keep talking about it? And do we want a separate topic or just do it here?
Effectively yes. In my mind we also separate the processing of these TOML files into two phases: a discovery/parsing phase and an execution phase. That would let us do some interesting things[1] such as apply a policy to path extension and code execution.
It should be as easy as .pth discovery, since in my mind, this TOML file would sit exactly where the .pth file would sit.
Yep. It would probably be generated by packaging tools, but I do think human readability[2] is important. Now that we have TOML parsing in the stdlib, I think this is a totally reasonable approach. We can even lazy import tomllib ![]()
Yep!
I am! I’m actually working on a pre-PEP and a prototype to explore the schema and semantics, but I think it all falls out pretty naturally[3]. I can create a separate topic once I have something a bit more concrete to share, but reach out if you want to collaborate.
Why overcomplicate it so much? Why not just define a directory where any .py files are run on startup, and let wheels be able to install into it. That’s no better/worse than what we have today[1], but at least it’s nice and obvious and doesn’t require complicated searching/parsing.
Which is why I’m personally in no hurry to change what we have. ↩︎
Something like site-packages/__startup__/?
With the rules being along the lines of:
*.pthfiles in that folder are processed in a new “strict” mode that only allows adding path entries (not code execution)*.pyfiles in that folder are implicitly imported on startup (it’s effectively just a namespace package with a special name)- both kinds of file would be processed in lexical order rather than file system order
- path files would be processed first, then startup modules imported
Seems plausible to me.
A couple of thoughts:
- By defining the ordering, you’ll get people wanting some sort of convention to allow packages to pick their priority. Which adds complexity that the current mechanism doesn’t support (and so clearly isn’t needed
). It might be better to explicitly state that the order files are processed is undefined. - One weakness of both this and the current scheme is that it’s difficult to determine which package a given file came from. Maybe have a per-package startup directory, in
*.dist-info/startup? The downside is that scanning all those directories is (possibly significantly) slower.
What’s driving this change? Is this really any better for the reasons people are bringing this up now than .pth files since it allows exactly the same things, just split into two seperate behaviors?
We’re talking about changing what’s allowed and how something works, what’s the intended benefit for a change that’s going to take years?
Changing the path is not necessarily innocuous, even without direct code execution. Examples of this were given above, and it’s possible to use this to override modules that are currently practically always imported (such as sys or re) in non-obvious ways.
Security scanning tools are going to need to handle .pth files either way.
There are various reasons why I do not want to rely on .py files for this purpose, which I am outlining in the PEP that is currently in draft state (and in my reference implementation draft PR). Once I resolve one or two little things I’ll publish it and post a new DPO thread. Stay tuned!
Of course, that doesn’t mean others can’t write a competing PEP!
What’s driving the proposal of “directory that gets run at startup” is Barry’s threat of a PEP that’s going to involve parsing TOML files instead in order to determine whether there’s anything to run and then perhaps not actually running it, which already reeks of complexity (especially for a core feature like this, as opposed to a packaging standard).
Significantly on some platforms and scenarios, for sure. The runtime currently doesn’t even consider *.dist-info directories, since they aren’t importable names, so it potentially doubles the number of directories to look at on startup.
I’m less concerned about determining which package a file came from. If someone needs to know, searching all RECORD files is not a prohibitive way to answer the question (though it would be if we were going to do it every time at startup).
The biggest risk is packages inadvertently stomping on each other’s files, but even that I think is easily handled (“make your name unique”).
I really don’t see enough value from making this whole thing drastically more complex. Though it looks like Barry just posted his PEP, so maybe that’ll show the value…
I considered that, but with a nicer startup mechanism defined, it becomes more reasonable to define a startup dependency manager as a third-party utility that looks for its own dedicated entry point metadata rather than the startup components each registering their own startup code directly.
That applies to both the directory based idea and @barry’s TOML based idea (it even applies to the status quo).
I’m not sure what the use case for doing that implicitly on startup would be, but regardless, it doesn’t need to be baked into the core machinery.
Plenty of “systems” like this just end up with files named 00_my_module.<whatever>, 10_after_other_one... (see, e.g. OpenSSL’s build system generator, or the plumber section of the Yellow Pages for us old folk
).
Provided you only specify one side of the equation, then the other will adapt to meet it. So if we want to constrain the names that may be chosen, we should unspecify the order, since authors now can’t influence it (I disagree with this approach). Or if we specify the order, we should leave the naming completely unrestricted, so that authors can take advantage of it.