Record the top-level names of a wheel in `METADATA`?

Why is that an edge case? If installing the wheel would introduce a top-level namespace with nothing else installed, then list the name.

That’s specifically why I said “repeatable” (“multiple use” in Core metadata specifications - Python Packaging User Guide ); just list all of them.

Probably build back-ends.

To be honest, I was more concerned about whether the metadata made sense before I worried about the UX. But for an initial proposal:

  1. Introduce an import-name field to pyproject.toml; an array of strings.
  2. If import-name is not supplied, take name, normalize it, and that is the one-item array value for import-name (whether METADATA still gets an explicit Import-Name entry in this instance or it is the implicit assumption I don’t know).
  3. If import-name is supplied, use its value (name is not used).
  4. If it’s dynamic then the back-end will calculate the value.

I think this covers the common case of “normalized name is what people import”, let’s a back-end be fancy if desired, and let’s projects like pywin32 statically list all of their names.

As mentioned above I don’t think multiple wheels listing the namespace as what is exposed would be useful and I don’t think we can in all cases infer the sub package within that

pywin32 was actually the example I had in mind of projects that expose many things and therefore would not be as useful to users, or in the case of Black as I mentioned 2 of the 3 packages installed should not be exposed to users/are not meant for direct usage

So are you saying the info isn’t useful at all, or just in this specific instance?

I’m personally not bothered by submodules/subpackages.

I would still find this information useful for my use case.

But they are still exposed, so whether users are meant to import them are not they names still exist.

1 Like

Your use-case mentioned in the original post would require subpackages for certain packages, from the trivial azure.cosmos to the difficult sphinxcontrib.matlab

Interesting question! I suppose if someone finds this information useful then I wouldn’t have much motivation to be opposed other than technical imposition on build backends.

What is your use case exactly?

I would say it “required” it, but it would be helped. I also realize that top-level names are way easier to list accurately than submodules, hence my specific scoping.

I don’t need a 100% solution, but something better than assuming “package name == import name”.

Helping users resolve import errors. Consider a beginner that copied some code from the internet and they didn’t bother installing the appropriate dependencies. Anything we can do to help them solve that would be great. This ties into A look into workflow tools - package management in the Python extension for VS Code where we are trying to get things in VS Code so that beginners just have a better chance of succeeding and get them following good practices. So in this instance we could have a code action in VS Code that says, “install missing dependency from PyPI” which could provide a list of packages on PyPI that match that top-level name (which I suspect is usually of length 1, but if it’s more than they can pick the best fit). And if they find a package that seems to work and accept it we can then save the pinned version to a requirements.txt file (until we can install only dependencies from pyproject.toml), making their code easier to share.

I can also see this helping with code analysis where dependencies aren’t listed somewhere.

5 Likes

Assuming that we are not talking about adding an API to the package index for searching for top-level/importable names; is it the case (for the ideas mentioned in the thread), that there would still be some guessing involved in the process when creating such tools? (E.g., the tool would have to guess a potential list of candidates first[1], then use a index API to retrieve metadata and filter the initial list to reduce the number of options, and finally the user would have to pick one).

If I understood things and these assumptions are correct, it seems that we kind of can implement some of this feature nowadays, by skipping the filtering step and relying on the user’s best judgment for choosing which package to install. I also suspect that the initial “guessed list” would have length 1 most of the times, so maybe the result would not be too off? How much better would the workflow be (and in which circumstances) if we expose a new METADATA field?

In the case we are interested in exploring an API for searching top-level names in the index (which I don’t know if it is viable), and/or developing code analysis tools that work with distributions that have already been installed at the user’s machine, does it make sense to add a new field to METADATA or instead standardise and improve practices that we already have in place (e.g. build on top of the top-level.txt file)?

The pros for standardising top-level.txt and improving the availability of its information is that a good percentage of packages in PyPI already have top-level.txt and it is something other APIs already make use of (e.g. importlib-metadata). This would mean that at least we would have some initial/partial backward compatibility.

Alternatively (and I don’t know how viable it would be), we could also consider how the information on RECORD could be exposed/queried without having to download the package…


  1. Which is somehow non-obvious to make exhaustive, e.g. Pillow and the namespace packages mentioned before. ↩︎

Even if we choose not to address this by standardising top-level.txt, we cannot ignore the prior art here. I don’t know how reliable top-level.txt is in practice - @abravalheri I believe it’s a setuptools-specific feature at the moment, can you explain how it works and what limitations it has?

But for those projects that do have a top-level.txt, what (apart from “it’s non-standard”) makes it unsuitable here? If the answer is simply “PyPI doesn’t let you search on what’s in that file”, surely that’s just a feature request for Warehouse?

For anyone else who has never heard of top_level.txt until now, I found https://svn.python.org/projects/sandbox/trunk/setuptools/doc/formats.txt as a reference.

Typically, sure, but we all know users don’t always care about “correct most of the time” when it’s holding up their work. :wink: I mean we will very likely do this, maybe with a handful of very common special-cases (looking at you, Pillow and scikit-learn), but obviously having more accurate information to cover more cases would be appreciated.

I wouldn’t call it unsuitable, but METADATA has the perk of being available without downloading the sdist or wheel entirely via PEP 658.

1 Like

Let’s not have PEP 658 be a reason for dumping everything into Metadata…

I’m not saying this shouldn’t go into Metadata, but I think it needs a good reason beyond this. PyPI can index top-level.txt at the same time it extracts metadata, and there is precedent in direct-url.json for using separate files, so I think we should look at each case on its merits. For me, existing usage is a significant advantage (it’s already supported by importlib.metadata, for example).

1 Like

Sorry, a small correction on the information I gave previously: the correct file name is top_level.txt (with the “underscore” character, like entry_points.txt).

It should be fairly reliable and it seems it been produced since setuptools 0.5a9, so a good chunk of published packages should have it. Its implementation does not seem to have varied significantly over the years (I can only see formatting changes), so it should also be fairly stable.

setuptools get its value by iterating though its packages, py_modules and ext_modules configuration fields (either explicitly given by the user or “auto-discovered”) and yielding only the toplevel names.

As far as I know, the limitations would be the following:

  • No nested module/package is listed (so in the case multiple distributions share the same namespace, their top_level.txt file would be the same)[1]
  • If any plugin or build customisation bypasses setuptools mechanisms to add files directly to the distribution archives, they will not be considered when deriving top_level.txt
  • If a regular build is customised to include a .pth to inject modules from another location, it is very likely those are also not considered when deriving top_level.txt.

importlib.metadata also prefers to use top_level.txt information when it is available (instead of processing RECORD), when creating the importlib.metadata.packages_distributions mapping.


  1. If fully matching the name of the module being imported is desirable, maybe getting the RECORD file from the index could be considered. ↩︎

That sounds pretty good, and I’d support standardising this so that tools can reasonably expect this to be present regardless of build backend. (I suppose we shouldn’t be surprised at the possibility that setuptools solved this problem years ago - there were a lot of good ideas implemented in setuptools before PEP 517, that aren’t immediately obvious when you’re writing a new backend from scratch).

2 Likes

Is the file produced for sdists, or only for wheels?

After inspecting the following setuptools files, I would say that the top_level.txt should be available for sdists too (although in a different location, i.e., inside the *.egg-info directory that setuptools includes in sdists).

I looked into a few examples as evidence (via inspector), and the file is indeed available in: build-0.9.0.tar.gz, pip-23.2.tar.gz, boto3-1.28.5.tar.gz, python-dateutil-2.8.2.tar.gz, matplotlib-3.7.2.tar.gz, pandas-2.0.3.tar.gz and others.

However, it is possible to prevent the content of the .egg-info folder for being included in the sdist with customisations (e.g. via MANIFEST.in or by changing command implementations via cmdclass)[1].


  1. Probably the case why I cannot find them in numpy-1.25.0.tar.gz ↩︎

The sdist spec suggests that it should be in the PKG-INFO directory. But as that spec only applies for sdists which use Metadata 2.2, that’s more of a “what things will be like in the future” comment. I’d say that the general principle (at least as applied by pip) is that nothing in a[1] sdist is currently reliable, so YMMV.

@abravalheri’s answer is more practical right now, though :wink:


  1. Metadata < 2.2 ↩︎

Are we happy enough with this idea to want a PEP for it?

With what idea precisely? Standardising top_level.txt, or adding a new metadata field? Personally, I’m OK with the former, but not really comfortable with the latter at this point.

I personally feel like adding it to METADATA makes sense, I don’t see a particular reason to make it a separate file. It’s not like it’s going to be particularly large, etc.

1 Like

My preference is for the latter, but I will take the former.

Metadata arguably already has/had a place for this information: the Provides field - added in metadata version 1.1, deprecated in 1.2 (2010), but never actually removed as far as I can see.

Flit has been putting the top-level import name in Provides for ages. With namespace packages, it should supply a name like namespace.my_bit, i.e. the subpackage that this distribution provides. :slightly_smiling_face:

From PEP 314:

Each entry contains a string describing a package or module that will be provided by this package once it is installed. These strings should match the ones used in Requirements fields.

And the requires field that’s referencing:

The format of a requirement string is identical to that of a module or package name usable with the ‘import’ statement, optionally followed by a version declaration within parentheses.

Strictly that doesn’t say that it is an import name, only that it’s in the same format (plus a version number), but all the examples are import names, and it’s the obvious thing to do with it.

In practical terms, I guess it might be easier to specify a new field than to revive the old one and deal with whatever people might have put there over the last 20 years. But I thought I’d give a gentle plug for making use of what we’ve already got. :wink:

7 Likes