Pre-PEP: Import name metadata

As threatened promised, here is my next planned PEP (I haven’t bothered making it an official PEP yet in case the idea is considered downright bad, which would save me writing the PoC and translating this to reST).


Abstract

This PEP proposes extending the core metadata specification for Python packaging to include a new, repeatable field named Import-Name to record the import names that a project owns once installed. This also leads to the introduction of core metadata version 2.5.

Motivation

In Python packaging there is no requirement that a project name match the name that you can import from that project. As such, there is no clean, easy, accurate way to go from import name to project name and vice-versa. This can make it difficult for tools that try to help people in discovering the right project to install when they know the import name or knowing what import names a project will provide once installed.

As an example, a code editor may detect a user has an unsatisfied import in a selected virtual environment. But with no way to reliably gather the import names that various projects provide, a code editor cannot accurately provide a user with a list of potential projects to install to satisfy that import requirement (e.g. it is not obvious that import PIL very likely implies the user wants the Pillow project). This also applies to when a user vaguely remembers the project name but does not remember the import name and would have their memory jogged when seeing a list of import names a package provides. Finally, tools would be able to notify users what import names will become available once they install a project.

Various other attempts have been made to solve this, but they all have to make various trade-offs. For instance, one could download every wheel for every project release and look at what files are provided, but that’s a lot of CPU and bandwidth for something that is static information (although tricks can be used to lessen the data requests such as using HTTP range requests to only read the table of contents of the zip file). This isort of calculation is also currently repeated by everyone independently instead of having the metadata hosted by a central index server like PyPI.

Rationale

This PEP proposes extending the packaging core metadata so that build back-ends can specify the highest-level import names that a project provides and owns if installed. By having the back-ends provide the information it increases the chances it will be specified and makes adoption easier. It also allows for quick pick-up by peoples’ toolchains.

By keeping the information to import names a project would own (i.e. not implicit namespace packages but modules, regular packages, and submodules), it makes it clear what project maps directly to what import name is provided exclusively by the project once installed.

By keeping it to the highest-level name that’s owned, it keeps the data small and allows for inferring implicit namespace packages that a project contributes to. It also helps let the build back-end be accurate with the data for import names when import semantics are the default ones (i.e. the import-related attributes in the sys module have not been manipulated). This should minimize the need for users to have to provide this information in order for it to be accurate as regular packages and modules manipulating import details typically happen for import names below them. This also allows for inferring the implicit namespace packages the project contributes to. Admittedly, this does mean that if someone accidentally releases a single implicit namespace package that only contains submodules then all of submodules would be individually listed.

Because this PEP introduces a new field to the core metadata, it bumps the latest core metadata version to 2.5.

Specification

The Import-Name field is a “multiple uses” field. Each entry of Import-Name represents an importable name that the project provides. The names provided MUST be importable via some artifact the project provides for that version, i.e. the metadata MUST be consistent across all sdists and wheels for a project release to avoid having to read every file to find variances. It also avoids having to declare this field as dynamic in an sdist due to the import names varying across wheels.

The names provided MUST be one of the following:

  • Highest-level, regular packages
  • Top-level modules
  • The submodules and regular packages within implicit namespace packages

provided by the project. This makes the vast majority of projects only needing a single Import-Name entry which represents the top-level, regular package the project provides. But it also allows for implicit namespace packages to be able to differentiate among themselves (e.g., it avoids having all projects contributing to the azure namespace via an implicit namespace package all having azure as their entry for Import-Name but instead a more accurate entry like azure.mgmt.search)

The names provided in Import-Name MUST NOT be filtered based on what is considered private to the project, i.e. it must be exhaustive for names that an import statement would succeed in using. This is because even “private” names can be imported by anyone and can “take up space” in the namespace of the environment.

Build back-ends SHOULD set Import-Name on behalf of users when they can infer the import names a project would provide.

Examples

In httpx 0.28.1 there would be only a single entry for the httpx package as it’s a regular package and there are no other regular packages or modules at the top of the project.

In pytest 8.3.5 there would be 3 entries:

  1. _pytest (a top-level, regular package)
  2. py (top-level module)
  3. pytest (a top-level, regular package)

In azure-mgmt-search 9.1.0, there would be a single entry for azure.mgmt.search as azure and azure.mgmt are implicit namespace packages.

Backwards Compatibility

As this is a new field for the core metadata and a new core metadata version, there should be no backwards compatibility concerns.

Security Implications

The information provided by build back-ends may not be accurate (either accidentally or on purpose), and so tools should NOT make security-related decisions based on the information provided in an Import-Name entry.

How to Teach This

Project authors should be taught that build back-ends can now record what namespaces their project provides. They should be told that if their project has a non-obvious namespace from the file structure of the project that they should specify the appropriate information manually. They should have it explained to them that they should use the shortest name possible that appropriately explains what the project provides (i.e. what the specification requires to be recorded).

Users of projects don’t necessarily need to know about this new metadata. While they may be exposed to it via tooling, the details of where that data came from isn’t critical. It’s possible they may come across it if PyPI exposed it (e.g., listed the values from Import-Name and marked whether the file structure backed up the claims the metadata makes), but that still wouldn’t require users to know the technical details of this PEP. Users may need to learn that if their package leads to all the submodules being listed that they may have wanted a regular package instead.

Reference Implementation

XXX

Rejected Ideas

Re-purpose the Provides field

Introduced in metadata version 1.1 and deprecated in 1.2, the Provides field was meant to provide similar information, except for all names provided by a project instead of the distinguishing namespaces as this PEP proposes. Based on that difference and the fact that Provides is deprecated and thus could be ignored by preexisting code, the decision was made to go with a new field.

Name the field Namespace

While the term “namespace” name is technically accurate from an import perspective, it could be confused with implicit namespace packages.

Open Issues

N/A

Acknowledgments

Thanks to Josh Cannon (no relation) for reviewing drafts of this PEP and providing feedback. Also thanks to everyone who participated in a previous discussion on this topic.

Copyright

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.

29 Likes

Seems like a reasonable idea.

I apologise, I can’t work out how I can make this proposal as interesting for you as the lockfile one was. I hope you don’t mind :slightly_smiling_face:

8 Likes

My main question is why is metadata better than scraping and keeping the list (even putting it on PyPI, if that’s critical for people to trust it)? That’s still going to be required, even with the metadata, and honestly the difference in just the download size when updating the list. And I bet if someone puts up a list then that’ll attract most of the downloads - nobody really wants to be scraping PyPI if they don’t have to.

The main thing I can see this metadata being used for is for PyPI to reject your upload if the contents of your wheel doesn’t match. Obviously we can’t just allow the metadata to vary from reality, but if we’re going to check that at any point then that’s also a perfectly good point to collect the data, and no metadata is required.

But if there’s general consensus that having the metadata is (somehow) better than reading the top level names in the package, the details of the proposal seem fine to me.

3 Likes

Because I like this proposal and can’t resist the urge to bikeshed, I’ll suggest Import-Path instead. To me import flufl.i18n provides a path through the import system similar to a file system path, rather than specifically a name. There’s also the fact that the imported module can be bound to a different name[1] through import ... as.

OTOH, if you think Import-Path is easily confused with sys.path which has a different purpose, or if you think Import-Name rhymes better with the documentation for the first argument to importlib.import_module() or __import__() then I won’t argue too much.

One of the things which always seems unfortunate is the distinction between distribution package and import path in importlib.metadata. For example:

# create and enter a venv
(venv) $ pip install --quiet pillow
(venv) $ python
>>> from importlib.metadata import *
>>> import PIL
>>> version('PIL')
Traceback (most recent call last):
...
importlib.metadata.PackageNotFoundError: No package metadata was found for PIL
>>> version('pillow')
'11.2.1'

That makes me sad! I know we have packages_distributions(), and I’m not sure whether your proposal would help, but I am wondering if you’ve thought about how this new metadata would be exposed at runtime to Python.


  1. in the current namespace ↩︎

2 Likes

There’s no way to specify namespaces manually, though, right? I take it from the text and from the lack of a new project.<whatever> section that there’s no way to specify Import-Name (Import-Path) and it’s only a backend thing?

What about SDists vs. wheels? I see no mention of Dynamic here, but a backend might not know what will be created during the wheel step when it’s making the SDist. For scikit-build-core, Python package discovery is automatic, but a user can write out a top-level compiled module or even a top-level folder during the wheel build step. The backend can’t actually know that it will not be Dynamic without some user specification. One the wheel is made, though, then it could be produced reliably by any backend.

I agree that this is a big problem and as such I’m in support of this as a motivation. Overall the proposal seems like a good idea. :slight_smile: However, as usual I’d like to think a bit about what we see as an overall solution to this problem and how this proposal fits into that. Basically what I mean here is that if some packages provide this and some don’t, and/or some provide incorrect data, then in many of the situations where we don’t currently have a solution, we still won’t have a solution, and thus everyone who needs this information will still have to have all the other hacks they currently have, plus this one.

In particular I’m uncertain about how useful this metadata is without a concurrent effort to provide it on PyPI in a way that addresses the use cases mentioned in the proposal. For instance, as you mention, some important use cases actually want the reverse lookup of “what package provides this name” rather than “what names does this package provide”; and as you also mention, a current painful solution is to download all the wheels in the world and see what names they provide. Now with this metadata the solution will be. . . query PyPI for the metadata of every package and see what names they provide? There is no question that that is better, but the practical improvement for most users will be limited unless PyPI is actually going to provide the reverse lookup.

Maybe more basic is that if PyPI isn’t going to actually verify this (via some kind of post-upload processing) then again the utility is diminished. This gets into a whole host of other concerns I have about whether it’s possible to get to where we want to be without having PyPI do a lot more “gatekeeping” than it currently does, but, well, it’s something I think is worth thinking about when thinking about this proposal.

So I agree with @steve.dower that “what is in the package metadata” may not be the important question. Like maybe the more important questions are “is there something can tell me what names a given package provides and tell me what packages provide a given name? and is that something PyPI? and does PyPI check that it’s giving me the correct info?” The metadata proposal itself seems fine but I’m not sure how much we gain by adding this to the package metadata.

3 Likes

I don’t have strong opinions about the implementation but I want to say I would be interested in having that feature.

To give a practical example, I’m working with a lot of people in data science. They have the tendency to use conda environments which they often reuse and which do not differentiate properly between development and production dependencies. As a result, we built a tool that goes into each environment and looks at what dependencies are used or not in a project (given they’re not a requirement of a direct dependency). This is done by going through the site-packages of the environment and read the AST of the project, to find the import statements.
Since there’s no clear mapping between a package and its import name, we had to build (and maintain) a dictionary of package names vs import names.
Quite tedious and this PEP would fix this easily.

Another issue is that it’s not possible to know the import name of package without installing it first. So anyone trying to retrieve only the metadata to do something similar to what I mentioned will hit PyPI when only accessing the metadata would’ve been sufficient. Here’s a relevant issue under the deptry project.

This is less of a problem now that uv is out there though because it’s much easier to separate the dependencies of a project’s lifecycle but I don’t think relying on external tool is the way to go.

5 Likes

What is the security story for this? Let’s say someone uploads a package named djangho that declares django in its Import-Name and also installs a malicious django package. If a user types import django in their IDE without having installed Django into the Python environment, would the IDE helpfully tell them:

The package django couldn’t be found in this environment, but it can be installed using pip install djangho

5 Likes

In that case I’d imagine the editor would suggest the options (likely ordered by popularity).

Because there is a world where djangho is a legitimate package as well. In which case it is correct to suggest it.

As a cherry-picked example I automatically publish dozens of botocore packages, that are the upstream botocore package, but sharded to a single package. Same name(s), but legitimate.

That’s assuming there are other options. But what if the Django project hasn’t uploaded a distribution with Import-Name metadata, yet?

Please re-read. I’m talking about package djangho publishing an Import-Name: django, with a malicious installable payload.

That sounds… weird to do, if you’re not part of the botocore maintainers.

1 Like

That’s a problem once, though, right? As soon as django uploads their next release with the metadata the problem is solved.

I understood. I’m counterpointing with "and what about djangho which isn’t malicious and is legitimate. Is it a bit uncouth? Sure. Is it forbidden? Nah.


I hear you though, I think in a more organized world this wouldn’t have been a problem to start. But here we are.

1 Like

I think this proposal would be great and is a lot better than the current state.

I would like to hear if PyPI would validate this metadata. Otherwise, as others have mentioned, the value is diminished because it can never truly be relied upon. But I acknowledge things need to start somewhere.

1 Like

How would validating work for packages like setuptools which adds its setuptools/_vendor directory to sys.path? The vendored packages would need to be declared as top level imports but they would appear to just be submodules to any kind of static analysis.

Looks like a very solid, useful, extra piece of metadata, no complaints about how it is declared or specified from me.

While in an ideal world we would also have indexes validate this, I don’t think it’s possible without significant changes to the import system to restrict the behavior more. I’m fine with the note about security impact and that it shouldn’t be relied upon for security purposes. There are a lot of ways to modify module search path, forcing indexes to detect all of them accurately seems like a challenge not worth requiring. There’s also nothing here forbidding this from being validated by indexes in the future if someone thinks this is valuable enough to work through all the possible ways this can happen, though I’d warn anyone thinking this is reasonably solvable in python that it’s possible to modify module search path from native extension code. This would be somewhat strange, but it’s possible.

1 Like

Nice proposal! Can we use this metadata to warn or error on cases where packages would overwrite each others’ modules? We had several cases where user imports were not working, and it turned out this was because there installing two packages occupying the same directories (we had reports for the different opencv distribution, but this happens with other packages, too).

4 Likes

It does not seem like the change proposed in this PEP is necessary for installers to warn when they are about to overwrite files or directories. Or maybe this warning could happen at (or right after) dependency-resolution-time, and that could be indeed an improvement.

2 Likes

It’s been a long time coming, great job finally taking a crack at solving this issue!

Can you please be specific here and state “implicit namespace packages” i.e. PEP 420? I understand the other mechanisms are considered legacy but it’s possible some new mechanism comes along in future.

Does this also apply to top-level non-.py modules i.e. extension modules? For example, Mypyc under the default configuration produces a top-level shared library. See the latest version of Black has a file 30fcd23745efe32ce681__mypyc.cpython-313-x86_64-linux-gnu.so at the root of the wheel: black-25.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl | black | PyPI Browser

1 Like

I’m at least at a +0, but I’m with @steve.dower in wanting to see the full PEP explaining the limitations of the status quo more clearly.

For wheels and installed packages, this info can be inferred from the RECORD files (hence packages_distributions being feasible), so the main advantages I see myself are:

  • being able to do the same for sdists without building them first
  • potentially closing some gaps around editable install support when calculating the mapping at runtime
  • index servers being able to publish this info via the main metadata API, without having to go poking at RECORD files, and without having to define separate inferred metadata sharing APIs

I’m not too worried about the “What if the claimed metadata is wrong?” case, since there’s no general answer to that - it depends on the specific use cases and their associated threat models.

I do think that concern means that there will still be reasons beyond supporting older metadata versions for some tools to go poking at RECORD files and even directly at project release archive file listings.

Are you talking about dependency groups? This is now standard and implemented in many tools :slight_smile:

pip is easy to boostrap thanks to ensurepip in the stdlib, but is otherwise an external tool (from Python) as much as uv

1 Like

So, how should it be used then? Is there any use case that is not exploitable by a malicious package pretending to provide another popular package?

Either it can be trusted, or it cannot. If it cannot, what is the user even supposed to do with it?

2 Likes