PEP 829: Structured Startup Configuration via .site.toml Files

Are you as tired of .pth file as I am[1]? Here’s my proposal for a more structured replacement. Thanks to @brettcannon and @emmatyping for their early feedback.


  1. and have been for years ↩︎

7 Likes

I recommend either getting specific about the TOML version or specifying that the version is “the latest supported by tomllib”.

We’re missing this in packaging specs and I’m still puzzling over what (if anything) to do about it.

5 Likes

First feedback - this can’t be a Packaging PEP if it’s going to modify site.py. It’s got to be Standards Track.

  • As with .pth files, packages may optionally install a single <package>.site.toml, just like the current .pth file convention.

None of this is specified anywhere - the common case is that packages usually don’t install more than one, but there’s absolutely no other constraints here. Packages can install as many as they like, and name them anything they like.

The wording here (and in many other places) implies a convention that doesn’t exist for .pth files, which misrepresents the status quo. That bothers me.

  • The discovery rules for <package>.site.toml files is the same as <package>.pth files today. File names that start with a single . (e.g. .site.toml) and files with OS-level hidden attributes (UF_HIDDEN, FILE_ATTRIBUTE_HIDDEN) are excluded.

These aren’t really “discovery rules”, more like reasons to ignore certain files (and they aren’t particularly sensible reasons either… why did we add these checks?)

  • The processing order is alphabetical by filename, matching .pth behavior.

Strictly speaking it’s just an OS-defined order. You might want to take the opportunity to specify an actual order (presumably case-sensitive ordinal, as that’s simplest and handles all possible characters).

Relative paths are anchored at the site-packages directory (sitedir)

Slightly more convenient to say “the directory that contains the .toml file”.

While only {sitedir} is defined in this PEP, additional placeholder variables (e.g., {prefix} , {exec_prefix} , {userbase} ) may be defined in future PEPs.

Did you consider using the current sysconfig scheme’s paths here?

This PEP improves the security posture of interpreter startup

Not without removing .pth processing, but even then, it seems like the only improvement would be to avoid a denial-of-service attack where a broken .pth file is replaced by a safely-handled broken .site.toml file.

  • io.open_code() is used to read <package>.site.toml files, ensuring that audit hooks (PEP 578) can monitor file access.

There’s no need for this, since the TOML file can’t contain executable code. The existing hooks for imports and use of open_code by the standard importers will cover anything that actually runs here.


It seems to me that most of the improvements offered here (apart from the structured file format) could be pretty safely applied to .pth files already, without any compatibility impact. There’s no reason we couldn’t read them all before processing, or apply policy (at a later date), or add structured metadata (in comments, for compatibility). Even path substitutions could be added - at worst, we check that the path with literal braces in it doesn’t exist before making replacements. IOW, we could have the practical benefits in either approach, leaving this proposal as only the difference between file formats.

As mentioned above, I wouldn’t want to add a naming convention based on packaging to a core definition of .pth files, but I guess if you want to introduce that here then feel free. I’d probably move all that stuff to a “how to teach this” section as a recommended good practice but leave the definition allowing arbitrary filenames as today.


Finally, I’m concerned about the performance impact. Startup time is one of our biggest “issues”, and for all the work we’ve done to improve it, this is exactly the kind of thing that will immediately regress it. At least you’re not looking in subdirectories for the files, as was ~suggested previously! But for the limited benefit available, I think a performance regression on startup of any amount is a very big deal.

5 Likes

The startup cost for using toml is huge. On my machine I get

> time /e/git/cpython/python -c ''
Executed in   37.97 millis
> time /e/git/cpython/python -c 'import tomllib'
Executed in   89.24 millis
> time /e/git/cpython/python -c 'import tomllib; tomllib.load(open("foo.site.toml", "rb"))'
Executed in  140.84 millis

I also found the use of the distribution name rather jarring. Surely is there’s going to be a convention then it should be the import name?

4 Likes

I looked at /usr/lib/python3.14/site-packages/*.pth files on my Fedora 43 and I found:

  • abrt3.pth
  • beaker_client-29.2-py3.14-nspkg.pth
  • distutils-precedence.pth
  • google_cloud_core-2.3.3-py3.14-nspkg.pth
  • Paste-3.10.1-py3.14-nspkg.pth
  • protobuf-3.19.6-py3.14-nspkg.pth
  • straight.plugin-1.5.0-py3.14-nspkg.pth

abrt3 and distutils-precedence can be converted to a PEP 829 entry point, they execute arbitrary code, nothing interesting to standardize.

beaker_client, google_cloud_core, Paste, protobuf, straight.plugin use code generated by setuptools to create a namespace module. Example with protobuf (reformatted):

import sys, types, os
p = os.path.join(sys._getframe(1).f_locals['sitedir'], *('google',))
importlib = __import__('importlib.util')
__import__('importlib.machinery')
m = sys.modules.setdefault('google', importlib.util.module_from_spec(importlib.machinery.PathFinder.find_spec('google', [os.path.dirname(p)])))
m = m or sys.modules.setdefault('google', types.ModuleType('google'))
mp = (m or []) and m.__dict__.setdefault('__path__',[])
(p not in mp) and mp.append(p)

Would it make sense to add a declarative way to create such namespace module in TOML as well? I see these parameters:

  • path relative to sitedir
  • sys.modules key (package name)

The google_cloud_core pth file registers two namespaces: google, and then google.cloud. To register google.cloud, there is an additional line:

m and setattr(sys.modules['google'], 'cloud', m)

Are those nspkg pth files even still needed? We’ve had native namespace packages since 3.3 (PEP 420), aren’t these pth files just legacy namespace packages that never got cleaned up?

Agreed, and you can take that as a formal statement by the packaging PEP-delegate.

By tying these new .site.toml files to the package name, you make it impossible to replace other uses of .pth files. For example, a user (or distributor) simply adding a .pth file to customise sys.path. Basically, not all Python code in site-packages can be assumed to be part of a “package” in the packaging/wheel sense.

Also, the editables project, which is used to implement editable installs for build backends, injects a .site.toml file into a package. This needs to work even if the project has its own .site.toml file. So we’d need a mechanism for packages to have more than one .site.toml file.

And finally, if these files were required to use the package name, wouldn’t that artificially penalise packages based on their name, with there being no way to ensure that one package’s initialisation happens before another, except by carefully naming the packages themselves? While I doubt it would ever be common, I can imagine two closely related packages wanting to co-ordinate their startup like this.

From the PEP:

init – a list of strings specifying entry point references to execute at startup. Each item uses the standard Python entry point syntax: package.module:callable.

Is it not more accurate to say that the strings are object references, as defined in the entry point spec? That avoids the need to exclude extras.

Although maybe you don’t want to reference packaging standards at all here? If the packaging community were to extend the definition of entry points to allow new syntax[1], would the core implement the new standard?

Build backends and installers should generate <package>.site.toml files alongside or instead of <package>.pth files

Build backends and installers don’t generate .pth files (with the already noted exception of the .pth file used for editable installs). Such files are simply package files like any other, created by the developer and copied unmodified by packaging tools.

Build backends SHOULD ensure that the <package> prefix matches the package name.
Installers MAY validate or enforce that the <package> prefix matches the package name.

As I’ve already noted, requiring that there’s only one .site.toml file per package is probably not going to work, so I don’t think tools should try to enforce these rules.


  1. Unlikely, but possible ↩︎

4 Likes

Being direct, and I hope not too blunt in my feedback here, this being the direction would result in me recommending all use of python include disabling site customization (eg. python -S). The impact on startup time to rely on the standard library’s toml implementation doesn’t seem worth the benefits of a more modern looking format for the information.

I also am not swayed by any claim that this is any more secure. If it would be beneficial to elaborate on this, I can do so, but I also don’t want to spend time retreading those discussions if unnecessary. When it comes to auditability, while it might be slightly easier for some people to reason about, anything more than simple actions in existing .pth files should already be a reason to be alarmed or to move those actions to the library’s own initialization at runtime, import, some context manager, or some other means that places the cost of that corresponding with the actual use.

As for deterministic ordering if this is implemented, I would go with treating the paths as bytes as reported by the underlying filesystem, and sorting them as python sorts bytes. The actual order just needs to exist and be reliable, neither users nor library authors should really be relying on the actual order.

1 Like

I have doubts about your benchmark. You should run Python with -S. Did you build Python from source in debug mode?

On my Fedora 43 with Python 3.15 built in release mode, I measured that importing tomllib takes 9.5 ms (7.77 ms → 17.3 ms). I wrote an optimization to reduce the startup time to 0.9 ms (10.6x faster).

Parsing the TOML file only takes 42.9 us:

$ ./python -m pyperf timeit -s 'import tomllib' 'tomllib.load(open("package.site.toml", "rb"))' 
Mean +- std dev: 42.9 us +- 1.3 us

$ cat package.site.toml 
[metadata]
schema_version = 1

[paths]
dirs = ["../lib", "/opt/mylib", "{sitedir}/extra"]

[entrypoints]
init = ["foo.startup:initialize", "foo.plugins"]
1 Like

Yeah, I know this issue. I’ll try to tighten up the language here, but it’s also a bit tricky because if the package supports multiple versions of Python, the <package>.site.toml file needs to be readable by the tomllib from the oldest supported Python.

1 Like

Yep! Fixed.

That won’t change here, but fair enough. I’ll change the language to prefer a single one per package. I believe the PEP is flexible enough that multiple files won’t be necessary.

A little repo archeology uncovers this issue, which seems reasonable and sensible.

Is “alphabetical by filename” not specific enough? The code in the reference implementation is literally:

    # Phase 1: Discover and parse .site.toml files, sorted alphabetically.
    toml_names = sorted(
        name for name in names
        if name.endswith(".site.toml") and not name.startswith(".")
    )

I did, but it seems like overkill. The PEP does defer additional placeholders for the future.

The PEP is upfront about not claiming to be a complete solution, but I do think it improves things, because a declarative TOML file naming entry points is easier to scan and validate than arbitrary exec’d code.

This also gives us an avenue for more control over exactly what gets executed at start up time, although again the PEP is clear about deferring that entire mechanism to a later date.

Also because the presence of a site.toml file supersedes a parallel pth file, you don’t need to remove pth processing, at least from Python 3.15 forward[1].

Fair enough.

I don’t see how. If you had a package that supported Python 3.14 and 3.15-with-PEP-829, how could you ship an enhanced-pth file that would work on both?

It’s worth measuring that, which admittedly I haven’t done yet. I’m as concerned about startup time as anyone, but let’s see what if any the impact is before getting worried.

On a positive note, the reference implementation does use lazy imports, so if there are no TOML files, there should be no additional site.py overhead.


  1. assuming this were to get accepted for 3.15 of course ↩︎

It’s worth spending some time making TOML parsing faster, but the import itself is lazy, and it would be amortized over all site.toml files found. It’s worth breaking down the tomllib.load() vs open()+read parts of that 140ms.

I believe distribution name is the convention for pth files, but as @steve.dower pointed out, there’s nothing on the interpreter side that enforces those names.

1 Like

Thanks! I have plans to look at some existing pth files to make porting suggestions, and also editable installs.

I have the same question!

As mentioned above, the language in the first draft of the PEP was overly restrictive. There’s nothing preventing those other uses too.

I must have missed this on an admittedly simple grep of the editables repo. Do you have a pointer?

There’s no way to do this with pth files currently, and site.toml files use the same mechanism, so that’s about the only way to do it. As the PEP says, there’s no “external arbiter” for independent packages to coordinate their startup order, except by carefully crafting their pth (soon :winking_face_with_tongue: site.toml) file names.

I agree it’s likely rare, and I’ve thought about ways it could be done, but I definitely don’t want to tackle that in this PEP!

I’ll change the text to say that the syntax is inspired by packaging entry point syntax (true!), but isn’t explicitly tied to it.

Good point. Do you think any of the tool maker recommendations are worth keeping?

Thanks Victor. I’m happy to dig into this a little more with you, and will take a look at your draft PR. Keep in mind too that as mentioned above, my reference implementation lazy imports tomllib, so unless there actually are site.toml files, tomllib’s import overhead won’t apply anyway.

I’ve outlined why I think it is more secure, although of course there’s no claim of totally secure. I’ve also fleshed out the details in the next draft of the PEP.

With hindsight, the use of the name <project_name>.pth was probably ill-advised. Although given that editable installs (at least as implemented by the editables project) don’t support .pth files, because the core mechanism doesn’t have a way to load them from an arbitrary location, the filename that I use isn’t going to clash with any .pth files supplied by the project in practice.

The .pth file here serves two functions:

  1. To add directories to sys.path as specified by the project.add_to_path() method.
  2. To set up the import hook needed for project.map() by running (importing) a bootstrap module on interpreter startup.

Both of those could be replaced with PEP 829, but to be perfectly honest, I don’t see the value. The .pth file I use is very straightforward, and easily auditable. Yes, people could write more complex .pth files, but if that’s the real concern, why not simply deprecate (and ultimately remove support for) lines in .pth files that start with import and contain anything more than a single (possibly dotted) module name?

If you want a more complex example, setuptools uses .pth files in its implementation of editable installs here, although I don’t fully understand the (multiple) methods they use.

Yes, and that was my point. If you require .site.toml files to use the package name, even carefully crafting the filename isn’t an option. But if you’re removing the restriction that the package name must be used, my point is addressed.

No. Tool makers should have no interaction with .pth/.site.toml files, unless they are writing them for their own purposes, in which case they aren’t “tool users” but “startup configuration writers”, who should know the details of the feature they are using anyway.

One concern I do have is that if the mere existence of a .site.toml file disables processing of a similarly-named .pth file, what’s to stop a package either maliciously or accidentally disabling an existing .pth file? Given that there’s no restriction on .pth file names, naming restrictions on .site.toml files can’t prevent this. At the very least, this should be noted in the “Security Implications” section of the PEP.

While on the subject of the “Security implications” section, I’ll note that all of the security benefits noted there could be achieved by my suggestion above of simply restricting import ... lines in .pth files to importing a single named module. And in fact, by requiring startup code to be isolated in a separately-importable file, it provides even more transparency and auditability (by avoiding the possibility that the startup code is obscured by being embedded in a larger file).

Furthermore, “Deprecate arbitrary code on import lines in .pth files” seems like it could just be a simple PR, rather than needing a whole PEP and a new file format.

Thanks for writing this up! A TOML file is much cleaner, and much better for introspection by tools.

Can you say some more on the intended use cases of these files? The main case I’m aware we want are editable installs, and they alone would allow for a simpler format.

Related to that, should we have a pre-first-line code execution mechanism? All cases I’ve seen so far would have been better without it, and they come at significant drawbacks. Usually, these mechanisms cause problems with tools. For example, type checkers have documentation specifically for setuptools on how to turn off its execution-based editable machanism because it breaks static analysis. Similarly, they make package discovery harder, .pth files and their behaviors are a big reason uv doesn’t have complete sys.path handling yet (Consider system site packages if activated by konstin · Pull Request #11670 · astral-sh/uv · GitHub). I’m well aware that some packages do use them right now, but I would consider a new PEP an opportunity to remove some of the technical debt and migrate users to better mechanisms, where required.

Can we enforce that the <package> must be the same as that of the <package>-<version>.dist_info directory that originated it? This would simplify introspection and avoid clashes (we currently would have to scan all RECORD file in the venv to figure out which package it belongs to).

The <package> prefix should match the package name, but just like with .pth files, the interpreter does not enforce this. Build backends and installers MAY impose stricter constraints if they so choose.

Installers MAY validate or enforce that the <package> prefix matches the package name.

I believe these need to be specified, otherwise we get incompatibilities between installers. More realistically, we’d get no constraints at all because no single tool wants to reject packages the others accept. We have a bunch of cases where tools accept even nomally broken files because pip never validated those, people shipped packages with them and you don’t want to reject packages from pypi that already shipped.

schema_version (integer, recommended)

This key should be mandatory or not exist in the first version at all (and tool would only check that there’s no schema_version = 2). I don’t think we get a benefit from an optional key.

Placeholder variables are supported using {name} syntax. The placeholder {sitedir} expands to the site-packages directory where the <package>.site.toml file was found. Thus {sitedir}/relpath and relpath resolve to the same path with the placeholder version being the explicit (and recommended) form of the relative path form.

While only {sitedir} is defined in this PEP, additional placeholder variables (e.g., {prefix}, {exec_prefix}, {userbase}) may be defined in future PEPs.

From my experience with uv, this is going to cause problems when a user inevitably has a directory with curly braces in its name (let’s say a cookiecutter template), then we’d need an escaping mechanism, which leads to its own problems again. Could you expand on what the use cases are here?

Continue on error rather than abort
The .pth behavior of aborting the rest of a file on the first error is unnecessarily harsh. If a package declares three entry points and one fails, the other two should still run.

A possible alternative is validating this file on install and rejecting installing the package if it is invalid, which should prevent against invalid site.toml files ever getting distributed.

<package>.pth file processing is not deprecated or removed. Both <package>.pth and <package>.site.toml files are discovered in parallel within each site-packages directory. This preserves backward compatibility for all existing (pre-migration) packages. Deprecation of <package>.pth files is out-of-scope for this PEP.

Can you clarify why we’re not deprecating .pth files if we introduce a mechanism that seems much better than .pth files?


As mentioned above, I wouldn’t want to add a naming convention based on packaging to a core definition of .pth files, but I guess if you want to introduce that here then feel free. I’d probably move all that stuff to a “how to teach this” section as a recommended good practice but leave the definition allowing arbitrary filenames as today.

There’s a core tension here where this is technically a CPython behavior, but from a {user, security, tooling} perspective, this is a packaging behavior, and the user experience comes from that. While CPython (apart from importlib.metadata) is technically not packaging-aware, users almost exclusively interact with this as a packaging feature.

Re startup time, another possible option is to use a subset of TOML for those files: Tools that write them must use a specific format (which should come out correctly from most toml writers, but you can also hand-template it). A CPython parser could be handwritten with just some string matching for fast loads, while other tooling can still parse it as TOML. I’m not sure if we want to this, but I’d prefer it if the alternative is not using a structured format at all.

And finally, if these files were required to use the package name, wouldn’t that artificially penalise packages based on their name, with there being no way to ensure that one package’s initialisation happens before another, except by carefully naming the packages themselves? While I doubt it would ever be common, I can imagine two closely related packages wanting to co-ordinate their startup like this.

I wouldn’t want to support things this complex in pre-startup routines, this sounds like something that should happen by a different mechanism. I would expect it to at least break static analysis tooling.