Record the top-level names of a wheel in `METADATA`?

brettcannon · July 11, 2023, 8:21pm

I had another instance where I wished I could map a project name to the modules and packages they provide. I also realized that I suspect people come to PyPI looking for a project by its import name and is not always successful (look at sklearn · PyPI as a redirect example for this), so having a reverse lookup at pypi.org might be useful.

Would it make sense to have METADATA have a repeatable Import-Name field that lists a top-level name the wheel provides? I’m fine if this this optional, but for those that provide it I can see it being useful and a relatively cheap bit of metadata to be provided (obviously requiring it would be nice ). This could also be validated against the wheel contents, so it shouldn’t inherently lead to folks pulling out The massive bug at the heart of the npm ecosystem against this.

davidism · July 11, 2023, 8:37pm

Would the top-level names be sufficient? I’m thinking about packages that add to a namespace package. The top-level would be the same for all of them, it’s the second level in that case that would be relevant.

njs · July 11, 2023, 8:39pm

Would it be easy to pull it out of RECORD, or is there some nasty catch there I’m not thinking of?

I guess it could be inaccurate in cases where people are doing .pth file crimes.

fungi · July 11, 2023, 8:42pm

It does seem like it could be useful, though I’m not sure about
requiring it. Some applications are distributed on PyPI, not just
libraries, unless you’re expecting them to list the names of the
import packages their CLI entrypoint hooks are found in or
something…?

brettcannon · July 11, 2023, 8:49pm

For what I’m thinking about, yes. Even with namespace packages, they take up a top-level name when installed.

Easy **once you have RECORD. With METADATA you wouldn’t have to download the wheel thanks to PEP 658 and friends.

That seems reasonable. What I want to capture is what names are taken up when you install a wheel? That somewhat ties into @davidism 's point about namespace packages; you might be able to have multiple things take a name, but having just one of them still makes that import name come into existence.

dstufft · July 11, 2023, 8:57pm

I would avoid pulling things out of RECORD for this, while it will likely do the right thing the majority of the time, the edge cases are what will get you.

In theory the build backends could probably infer this, and provide a way to override it when that inference isn’t correct-- but that feels like something that should happen in the build backends, not in the thing consuming the artifacts.

I wonder if there are other “this thing provides X” that makes sense to expose in the metadata? The main one I can think of is commands, similarly to imports you can also get most of the way by looking at RECORD or entry_points.txt, but there are enough edge cases where putting it declarative-ly in METADATA seems like a solid idea (and likewise, build backends can probably auto populate it in most cases).

ofek · July 11, 2023, 9:16pm

I think this is not a good idea because of all the edge cases like namespaced packages and the fact that projects can and do ship multiple top-level packages e.g. Black

pf_moore · July 11, 2023, 9:16pm

In broad terms this seems reasonable. I’m not sure it can always be validated against the wheel contents, though (things like .pth files or import hooks would mess it up instantly), so we have an awkward situation where it’s sometimes safe, and sometimes not.

Also, are you intending that the project author supplies this data? If so, it needs to be added to pyproject.toml.

This feels very much like a classic 80-20^[1] situation where it’s useful, safe, and simple for almost all cases, but the exceptions are the problem. I agree it would be useful, but we do need some way of handling those edge cases (especially if the edge case can be exploited to cause user errors).

More likely, 95-5 or even 99-1… ↩︎

EpicWink · July 12, 2023, 1:45am

Here’s a proposal: the build backend finds all __init__.py files in the distribution, trivially converts to a list of (sub)packages, then culls all the deeper subpackages until only the top packages remain (all at the same depth) ^[1]. This should handle namespaces.

What are the special cases people are thinking of specifically?

or the top N depths ↩︎

pradyunsg · July 12, 2023, 5:04am

FWIW, I wrote Figuring out the top-level importable names from a wheel · GitHub some time ago. That doesn’t use RECORD, instead relying on the file listing from the zip itself.

mwichmann · July 12, 2023, 1:12pm

I’d say pywin32 is a classic example: there are a ton of things you can
import there, none of them named pywin32. Just yesterday got sent a
complaint that things were broken on that front, here with the reverse
problem - they knew the module they wanted, but not the package to install.

C:\Users\XXXX>pip install win32com.client

ERROR: Could not find a version that satisfies the requirement 
win32com.client (from versions: none)

ERROR: No matching distribution found for win32com.client

pf_moore · July 12, 2023, 2:37pm

… and if you look at the top-level.txt file in the pywin32 distribution, it bears very little resemblance to the contents of the wheel (thanks to the .pth file supplied in the wheel).

IMO pywin32 is a great example of both the benefits and the problems with this proposal. @brettcannon if you can say how you’d expect pywin32 to be handled, that would be a good test of the design.

brettcannon · July 12, 2023, 11:44pm

Why is that an edge case? If installing the wheel would introduce a top-level namespace with nothing else installed, then list the name.

That’s specifically why I said “repeatable” (“multiple use” in Core metadata specifications - Python Packaging User Guide ); just list all of them.

Probably build back-ends.

To be honest, I was more concerned about whether the metadata made sense before I worried about the UX. But for an initial proposal:

Introduce an import-name field to pyproject.toml; an array of strings.
If import-name is not supplied, take name, normalize it, and that is the one-item array value for import-name (whether METADATA still gets an explicit Import-Name entry in this instance or it is the implicit assumption I don’t know).
If import-name is supplied, use its value (name is not used).
If it’s dynamic then the back-end will calculate the value.

I think this covers the common case of “normalized name is what people import”, let’s a back-end be fancy if desired, and let’s projects like pywin32 statically list all of their names.

ofek · July 13, 2023, 3:46am

As mentioned above I don’t think multiple wheels listing the namespace as what is exposed would be useful and I don’t think we can in all cases infer the sub package within that

pywin32 was actually the example I had in mind of projects that expose many things and therefore would not be as useful to users, or in the case of Black as I mentioned 2 of the 3 packages installed should not be exposed to users/are not meant for direct usage

brettcannon · July 13, 2023, 10:28pm

So are you saying the info isn’t useful at all, or just in this specific instance?

I’m personally not bothered by submodules/subpackages.

I would still find this information useful for my use case.

But they are still exposed, so whether users are meant to import them are not they names still exist.

EpicWink · July 14, 2023, 1:16am

Your use-case mentioned in the original post would require subpackages for certain packages, from the trivial azure.cosmos to the difficult sphinxcontrib.matlab

ofek · July 14, 2023, 1:48am

Interesting question! I suppose if someone finds this information useful then I wouldn’t have much motivation to be opposed other than technical imposition on build backends.

What is your use case exactly?

brettcannon · July 14, 2023, 5:40pm

I would say it “required” it, but it would be helped. I also realize that top-level names are way easier to list accurately than submodules, hence my specific scoping.

I don’t need a 100% solution, but something better than assuming “package name == import name”.

Helping users resolve import errors. Consider a beginner that copied some code from the internet and they didn’t bother installing the appropriate dependencies. Anything we can do to help them solve that would be great. This ties into A look into workflow tools - package management in the Python extension for VS Code where we are trying to get things in VS Code so that beginners just have a better chance of succeeding and get them following good practices. So in this instance we could have a code action in VS Code that says, “install missing dependency from PyPI” which could provide a list of packages on PyPI that match that top-level name (which I suspect is usually of length 1, but if it’s more than they can pick the best fit). And if they find a package that seems to work and accept it we can then save the pinned version to a requirements.txt file (until we can install only dependencies from pyproject.toml), making their code easier to share.

I can also see this helping with code analysis where dependencies aren’t listed somewhere.

abravalheri · July 17, 2023, 4:29pm

Assuming that we are not talking about adding an API to the package index for searching for top-level/importable names; is it the case (for the ideas mentioned in the thread), that there would still be some guessing involved in the process when creating such tools? (E.g., the tool would have to guess a potential list of candidates first^[1], then use a index API to retrieve metadata and filter the initial list to reduce the number of options, and finally the user would have to pick one).

If I understood things and these assumptions are correct, it seems that we kind of can implement some of this feature nowadays, by skipping the filtering step and relying on the user’s best judgment for choosing which package to install. I also suspect that the initial “guessed list” would have length 1 most of the times, so maybe the result would not be too off? How much better would the workflow be (and in which circumstances) if we expose a new METADATA field?

In the case we are interested in exploring an API for searching top-level names in the index (which I don’t know if it is viable), and/or developing code analysis tools that work with distributions that have already been installed at the user’s machine, does it make sense to add a new field to METADATA or instead standardise and improve practices that we already have in place (e.g. build on top of the top-level.txt file)?

The pros for standardising top-level.txt and improving the availability of its information is that a good percentage of packages in PyPI already have top-level.txt and it is something other APIs already make use of (e.g. importlib-metadata). This would mean that at least we would have some initial/partial backward compatibility.

Alternatively (and I don’t know how viable it would be), we could also consider how the information on RECORD could be exposed/queried without having to download the package…

Which is somehow non-obvious to make exhaustive, e.g. Pillow and the namespace packages mentioned before. ↩︎

pf_moore · July 17, 2023, 6:27pm

Even if we choose not to address this by standardising top-level.txt, we cannot ignore the prior art here. I don’t know how reliable top-level.txt is in practice - @abravalheri I believe it’s a setuptools-specific feature at the moment, can you explain how it works and what limitations it has?

But for those projects that do have a top-level.txt, what (apart from “it’s non-standard”) makes it unsuitable here? If the answer is simply “PyPI doesn’t let you search on what’s in that file”, surely that’s just a feature request for Warehouse?