PEP 648: Extensible customizations of the interpreter at startup

mariocj89 · December 30, 2020, 1:02pm

Hi All,

After a quiet conversation in python-ideas, I’ve sent a new PEP, sponsored by @pablogsal. In short, this allows for customizing Python installations through an extensible manner.

Please provide feedback in this thread.

PR with the PEP: https://github.com/python/peps/pull/1752

Thanks!

thystra · December 31, 2020, 1:49am

Are there any security implications here for malware startup code injection?
Is it worthwhile have a section “Security” so to explicitly disucss? Even if the answer for the section is “No implications for security”?

ldorigo · December 31, 2020, 8:34am

Hi! I think this is a good idea and don’t have any feedback on the PEP per se, just wanted to note that in a couple of places you wrote “sitecustomze” rather than "sitecustomize, which was slightly confusing.

Also,

ldorigo · December 31, 2020, 8:38am

*…also, the section “Script naming convention” seems to be missing something (“even if they might likely” what?).

mariocj89 · December 31, 2020, 9:04am

the section “Script naming convention” seems to be missing something

Thanks! I’m fixing that.

a couple of places you wrote “ sitecustomze ”

Wow! That was silly. Thanks for pointing it out, I’m fixing it right now.

Are there any security implications here for malware startup code injection?

Really good point indeed. I’d like some guidance as I don’t have a lot of expertise here. I agree that we should add something. Let me check some other PEPs and take inspiration. In short, I believe this is not lowering the security given that we have pth files and sitecustomize already, but we should properly evaluate it as this is a new feature and potentially offer options to minimize any potential threat.

steve.dower · December 31, 2020, 10:46am

On my phone, so copy paste from github is terrible, but here are some initial thoughts.

Security wise, it will make things worse. Namespace packages are merged with all matching names on sys.path, and any actual package will hide the whole thing. Better to just make it a directory of scripts and avoid relying on sys.path to find it. (The directory needs an entry in sysconfig, or to be defined relative to an entry in there.)

Also, it would be nice to have a way for sitecustomize.py to suppress importing the namespace package. Probably something to call or set in site.py. That way controlled Python installs can lock down a single file, still allow package installs, and not risk arbitrary code execution.

Pip has concerns about code that is imported on startup, because it can lock files and prevent update or uninstall. Using -S isn’t an option, because that would omit pip itself, so making sure that “allow site but without customizations” is relatively easy to achieve will probably be necessary.

You keep saying “import the namespace package and execute any scripts found”. If you change to a normal directory, ignore this, but there are subtle differences between importing a module and executing a file, particularly for relative imports from that module. Please be clear whether each file will be imported or merely have its code executed (I would vote for the latter).

I’d also dispute namespace packages making this easier to teach. I regularly teach very smart engineers about them, and they ALWAYS start with misconceptions and finish with different misconceptions. They are complicated and get messy, and the sys.path resolution behaviour is never what anyone really expects (particularly the security implications). At best, this addition might make namespace packages easier to teach, but then, the things that need teaching are not things that we want this feature to have.

I’ve written enough now that I forget the rest. Once I’m back at work and a proper PC I’ll take another look.

mariocj89 · December 31, 2020, 3:12pm

Thanks for the feedback!

You touched some points that are indeed quite important to discuss:

On namespace pacakge vs site path:

On “being easier to understand”, I agree is quite subjective and namespace packages are not that trivial indeed.
I like the idea of it being a package as it provides a nice way of declaring dependencies should that ever be needed. I agree it is not a killer feature, and that if it were, we could find other ways, but I do not see any benefit on it not being a package. If your argument is that it’ll make it safer and you have some example use-cases, that might be fair and it might be worth to move towards the “folder with scripts” approach.

Please be clear whether each file will be imported or merely have its code executed (I would vote for the latter).

I was indeed thinking on import, as that allows for one script to say it depends on another by importing it. They can indeed use relate imports and just do from . import prereq_sitecustomize. Do you see any issue with the implementation importing rather than executing the script?

Using -S isn’t an option, because that would omit pip itself, so making sure that “allow site but without customizations” is relatively easy to achieve will probably be necessary.

If we need a specific flag or option in site, I’m OK with it. I just felt that there were already enough options to disable/enable things at startup, but if there is the need and the wish, I’m fine with it.

pf_moore · December 31, 2020, 6:06pm

However, it encourages the anti-pattern of having modules that do significant work on import. I feel that the idea of import xxx being a cheap operation that can be executed at any time is an important one, and we shouldn’t dismiss that lightly.

Also, using import to handle dependency management seems like a bad fit, as it implies that we could end up with

__sitecustomize__.a:

import b

# some other work, that doesn't use b

The idea that import b is used for side effects only, and the name b never gets used, seems wrong to me. (Many linters will complain about an “unused import”).

So I’m -1 on having the feature use a package, and import semantics. IMO, it should be a named directory (with well-defined locations) and “run the script” semantics.

uranusjr · December 31, 2020, 7:46pm

Also, depending on other __sitecustomize__ modules is likely to create complicated load-time dependency issues people will want Python to “fix” for them. Site-customising scripts tend to have side effects, and allowing a to import b would encourage users to rely on brittle import ordering. Does this import make b get executed before a? Is b guarenteed to be executed only once? This will be extremely messy fast, and it’s better to avoid the problem altogether.

ncoghlan · January 1, 2021, 2:22am

Definite +1 from me for replacing the pth file hack with a properly defined startup customisation mechanism.

However, as others have noted, while using a namespace package for this has its attractions, it creates new problems that even the “pth file with side effects” trick doesn’t encounter. In particular, site.py doesn’t scan the entirety of sys.path for pth files, it only scans “site package directories”. Using my Fedora system Python as an example:

$ python3 -c "import site; print(site.getsitepackages()); print(site.getusersitepackages())"
['/usr/local/lib64/python3.9/site-packages', '/usr/local/lib/python3.9/site-packages', '/usr/lib64/python3.9/site-packages', '/usr/lib/python3.9/site-packages']
/home/ncoghlan/.local/lib/python3.9/site-packages

That’s a shorter list than the full default sys.path. Most notably, even when running in non-isolated mode, the inferred sys.path[0] is never scanned for pth files, and it shouldn’t be scanned for __sitecustomize__ directories either.

Since we don’t want to scan the entirety of sys.path, that means “Python scripts in a specially named subdirectory of site package directories” is a better fit for the problem than an importable namespace package. To block the “subdirectories are namespace packages by default” behaviour, the interpreter should ship with an __init__.py in the default system and user site packages directories that raises an ImportError that states that __sitecustomize__ is not for importing, it’s for code that runs at startup.

If folks want to depend on other things having already been run from their customisation scripts, then those things need to be moved out of the customization scripts and into ordinary importable modules (which shouldn’t be a major burden, as the existing pth file hack has the same limitation).

steve.dower · January 1, 2021, 10:35am

This PEP has the advantage of having been written already, but another idea that’s been kicking around is to formalise entry points (I believe @jaraco has said that was always a goal).

Along with all the other benefits, that would make startup customisation just a special entry point (with a set of registered import/functions to call).

mariocj89 · January 1, 2021, 10:57am

I’ll update the PEP to then just search for files within __sitecustomize__ in the site paths. Thanks both for the feedback!

pf_moore · January 1, 2021, 11:25am

This raises the question - should the new mechanism only use “site package directories” or the full sys.path? If it’s a replacement for executable code in .pth files, then it should use the site package directories. But if it’s intended to work like sitecustomize/usercustomize, it needs to use the wider sys.path (but not all of it - the customize modules are imported early in the startup sequence, before the current directory is added to sys.path).

Currently the rationale suggests that the new mechanism replaces both of these - but as they don’t work the same, it can’t be a complete replacement for both…

ncoghlan · January 3, 2021, 2:37pm

While the interpreter would only look for __sitecustomize__ hooks in site-package directories, imports from those hooks would use the full sys.path like normal.

That said, we may want to reconsider programmatic deprecation of sitecustomize and usercustomize, and go with a documented deprecation instead.

That way if folks are using those for a purpose that the new hooks don’t cover, they can just keep using them. Only folks relying on the “executable code in pth files” hack would face having their code eventually break if they didn’t migrate to the new mechanism.

mauve · January 5, 2021, 3:44pm

I can see some excellent use cases for this but I also think it could be a huge source of pain for my firm and many end users.

The part I have a problem with is opting into start-up behaviour only as a result of having taken a dependency. One problem with this is transitivity: I will get the __sitecustomize__ behaviours of not just packages that I’ve selected for their __sitecustomize__ effect - but I will also get the __sitecustomize__ behaviours of the dependencies of libraries I’ve selected as dependencies, all their dependencies - everything in the dependency graph. If one of my dependencies’ dependencies’ dependencies thinks that betterexceptions is cool then I get it with no good way of turning it off.

Perhaps well-behaved packages will split their distributions into two, the functional part and the __sitecustomize__ hook, and make the __sitecustomize__ hook an extra, so that if I want the __sitecustomize__ behaviour I can depend on betterexceptions[sitecustomize], for example. But splitting a package in two seems like a lot more effort in terms of packaging and releasing so I can’t see it happening extensively.

Internally to my firm we have very deep dependency graphs. Each application developer doesn’t have exclusive control over what is added to their sys.path - there may be 20 other developers whose choices get unioned into the dependency set. Practically this will result in substantial negotiation between owners of libraries and applications about whether to keep or discard a dependency with __sitecustomize__ behaviours, which adds friction to the development process.

This problem also exists in entry points but most of the common use cases for entry points as plugins have some ability to manage which plugins are in effect. pytest requires you to both install a plugin and pass --with-plugin iirc. flake8 enables all plugins by default but you can disable them via config. Both Django and Sphinx don’t automatically discover plugins using entry points: they require adding them to INSTALLED_APPS and extensions respectively, and there are good reasons for it to work like that.

I’d be happy with any way to opt out of “automagic” effects and switch to an explicit list of sitecustomize hooks to enable. Actually, let me throw out a proposal:

If sitecustomize can be imported, import it.
Otherwise, do this __sitecustomize__ namespace package logic.

Then my opt-out is obvious: write a sitecustomize.py, and have it import the __sitecustomize__.xxx packages I actually want enabled.

This would also imply that the __sitecustomize__ behaviour SHOULD be optional: a package should still work when imported even if its __sitecustomize__ hook has not yet been run.

mariocj89 · January 6, 2021, 10:26am

Thanks for the feedback @mauve, but I think what you are proposing servers a different use-case. The kind of customization you mention probably fit more as entry-points indeed. I’d expect that if a library injects a file into __sitecustomize__ it is because it needs some customization at load time or because it needs to change something about the interpreter that I’d expect to be a key feature of the library (see virtualenv or betterexceptions).

Whilst I understand the concern of “what if I depend on something transitively”, I’d be surprised to find out that one of your dependencies depends on something that changes the interpreter like betterexcpetions.

You also mention things like plugins for django, I’d expect those will not place a script within __sitecustomize__, as they don’t really need to change in any way the interpreter at startup, those are customizations for the application itself.

Lastly, apps can today add things at startup via pth files, we are providing a better way to do it. I agree that enhancing this can result in more people writing this kind of scripts, but I’d be surprised if things like django plugings start to do it. We can reinforce the purposed of this feature in the documentation though.

Also, note that the whole site can be disabled at any time should you need that in your application when you are packaging things up.

mauve · January 6, 2021, 11:10am

The kind of customization you mention probably fit more as entry-points indeed.

I mentioned entry points as an example a problems we already experience of packages having an effect simply by being installed. I’m not arguing for using sitecustomize hooks as a replacement for entry points.

it needs to change something about the interpreter that I’d expect to be a key feature of the library (see virtualenv or betterexceptions).

The problem is that it is possible to depend on a library but not use it. In my company’s monorepo this turns out to be very easy, essentially because of transitive dependencies and the fact that humans like to organise their source tree around their mental models of a problem domain, rather than by carefully thinking about the effect they will have on projects that depend on them.

What is the use case in virtualenv? I don’t see that in the PEP.

Also, note that the whole site can be disabled at any time should you need that in your application when you are packaging things up.

By adding -S? We actually cannot do that; it breaks certain things. We would have to patch site.py to just not do this, which would be problematic to maintain.

Lastly, apps can today add things at startup via pth files, we are providing a better way to do it.

We’ve had to fight with libraries that dropped .pth files in the past. Besides, there is an admonition in the docs not to use this feature to change the interpreter:

Note: An executable line in a .pth file is run at every Python startup, regardless of whether a particular module is actually going to be used. Its impact should thus be kept to a minimum. The primary intended purpose of executable lines is to make the corresponding module(s) importable (load 3rd-party import hooks, adjust PATH etc). Any other initialization is supposed to be done upon a module’s actual import, if and when it happens. Limiting a code chunk to a single line is a deliberate measure to discourage putting anything more complex here.

mauve · January 6, 2021, 11:43am

The problem is that it is possible to depend on a library but not use it.

In fact, to generalise this objection, the problem is that is is possible to have a package installed but not use it; it doesn’t have to come from a dependency.

This affects OS distribution packages particularly. For example Ubuntu installs all Python packages into /usr/lib/python3/dist-packages/ and everything that runs with /usr/bin/python3 would share the same sitecustomize hooks.

Say we have two independent Python apps, “photobunny” and “gitzone”. Say additionally that gitzone uses a dependency called lazyimports which provides sitecustomize hook to modify the import system. gitzone was developed to require lazyimports, but it breaks Python programs that weren’t developed to use it. Then the Debian packaging conventions will have gitzone install its dependency python3-lazyimports into /usr/lib/python3/dist-packages. Then photobunny will run the sitecustomize hook for lazyimports at start, and photobunny will break.

pf_moore · January 6, 2021, 11:55am

… in which case, gitzone should never be installed to use a shared Python interpreter, because as you say it breaks all other Python programs using that interpreter.

Either gitzone should be packaged with its own embedded interpreter, or it should be built in such a way that it only enables its breaking import changes for itself. Remember, you can do this already with .pth files, so this is not new behaviour. If Debian packaging conventions prevent gitzone being deployed in a way that doesn’t break other Python programs, then that’s a failure in Debian’s packaging conventions, not in Python (or the library initialisation mechanisms it provides).

In reality, though, I’d say that lazyimports (as you describe it) is broken, and should be fixed to not break the Python installation before any project depends on it.

steven.daprano · January 6, 2021, 12:08pm

Daniel:

" The problem is that it is possible to depend on a library

but not use it."

If you have a dependency that is not actually used anywhere, and could

be safely replaced by an empty file, then is it really a dependency?