Record the top-level names of a wheel in `METADATA`?

I think this would be a nice addition that would have a couple of additional benefits:

  • It could help clearly eliminate one class of typosquatting if utlized on PyPI, perhaps by automatically setting up the kinds of redirects you described for sckit-learn. Users often are more aware of the import name of a package than the actual package name.
  • If this data was collected across packages, it could be used to create something like the command-not-found tool for imports, which provides installation candidates for apt given input:
$ pyt

Command 'pyt' not found, did you mean:

  command 'py' from deb pythonpy
  command 'iyt' from deb python3-yt
  command 'pat' from deb dist
  command 'yt' from deb python3-yt
  command 'pyp' from deb pyp
  command 'pytr' from deb pytrainer
1 Like

While it sounds tempting, you can’t really block “typosquatting” of
import names or set up redirects. In fact there are very good
reasons for multiple packages on PyPI to provide modules which use
the same import names, so collisions over those should be expected.
Any sort of search based on an import name should be prepared to
return multiple matches. It will presumably be up to the user to
decide which one they’re looking for in that circumstance.

1 Like

Would this standard help more packages/installation methods provide a declaration somewhere of import names? I’m interested in this sort of thing as I once wrote a personal hobby tool that depends on mapping imported names to distributions, currently using importlib.metadata.packages_distributions.

That seemed to work fine on the packages/installation methods I tried (including conda which is an important use case for me) but it’s not as satisfying to tell users “this should usually work but there’s no official reason to expect it to work for all of the packages you’re using”. AFAIK the RECORD file is purely optional.

1 Like

Depends on how the PEP is worded, whether it requires or suggests the metadata be provided.

I like Thomas’ idea of reusing Provides.

By directory do you mean the directory that file is stored in?

I’m very +1 on this (mostly on the “reverse-lookup” part).


For Pantsbuild, we have to map import names to code (either third party packages or first party code).

Our approach is to first look up the name in a map of “known” packages that map to modules that don’t match their package names, with a fallback of the package name is module name. Users can add their own mapping to help us fill in the blanks.

So a server we could ping and cache the result would be immensely helpful.


In another world, it’d be useful to map imports to packages as per the discussion in PEP 722 (specifically the thought experiment of if we already had this kind of mapping).

1 Like

Out of curiosity, what approach do you take when a drop-in
replacement package for an older abandoned project starts to gain in
popularity (specifically in the case where both the old and new
distribution packages provide the same import package name)? Does
the official mapping get updated and all projects relying on that
migrate at the same time, or do projects have the ability to
override the mapping locally on a one-by-one basis?

Similarly, I’ve had some projects go through a rename, updating the
distribution package name while keeping the original import package
name (either entirely or as a backward-compatible alias), though I
suppose those transitions are less problematic as long as there’s
not a corresponding Python version requirement shift. Or do your
mappings have the ability to vary by Python version, like how
environment markers work?

I don’t think anyone is suggesting a one-to-one mapping. That’s not realistic given all the existing packages, and has the issues you mentioned. If multiple packages provide the same import name, a tool could list the options.

Yup that one, although I haven’t heard that this has really been an issue for anyone. (not to say it isn’t an issue, just if it was, people weren’t vocal)

Yes, so far I’ve been assuming that tools which care to have a
single result for the reverse mapping would simply prompt the user
or require some other disambiguation signal when multiple results
are returned.

Then I guess I don’t understand what you were asking about. If there’s a replacement package for an older abandoned package and they use the same name, they’ll both show up in the list. What needs to be updated?

I think the distinction here is that nobody should be relying on this lookup to install their dependencies, because these names don’t uniquely specify dependencies and likely never will. But it would still be helpful.

Sorry if I was unclear. I wasn’t asking about the proposal, I was asking Joshua how (or whether) the mapping system he described for Pantsbuild handles those specific situations, in case there are solutions to the choosing problem I hadn’t considered.

This forum doesn’t handle discussion threading very well, so it probably wasn’t obvious what I was responding to. I’m switching from my normal E-mail replying to using a Web browser to see if that helps at all.

1 Like

To help move this along, here’s a poll to see where people sit.

(How) should we record the top-level module namepsaces a package introduces?
  • top_level.txt file
  • The Provides field in METADATA
  • A new field in METADATA
  • Nowhere; this information isn’t useful enough
0 voters

Which answer fits best for “I don’t think I care which field in METADATA, so long as it’s there” (because I can query that from PyPI for newer wheels)

Probably “a new field”, but honestly I would assume the combination of “new field” and “Provides” is going to provide guidance compared to top_level.txt.

I would love if we had better handling of namespace packages than what top_level.txt does.

3 Likes

I want to explain my thinking here, and in the context of my use case. That way others can agree or disagree or, better yet, correct my fallacies.

[Edit I was just repeating things I said above, so I’m removing it for the sake of brevity]

So being able to start generating the mapping of imports → packages would be (chefs kiss). I somewhat know that @brettcannon has a similar-eniugh use case in mind for VS Code.

Now on to observations…

  • Having the new info in metadata means PEP 658 (oh man I WORSHIP PEP 658 at this point) makes it easy to scrape
  • buuuuut this only is helpful with historical data (for at least a period of several years). Meaning there is a reasonable demand for backporting
  • Buuuut I’m not sure what the expectation for that is. Should PyPI (et al) be modifying METADATA? (Can they even? I think changing the metadata hash is likely a bad idea). Maybe we need another PEP to allow package hosts to expose a “backported, and inferred METADATA”?

+1 to the idea, but we may need to take it farther in another PEP to not have to sit on our thumbs for a few years. Or at least, that’s how I see it from my (maybe narrow?) worldview.

1 Like

That’s true to an extent and for a while. But over time more project will release new wheels and things will slowly backfill. And projects that don’t are much less likely to be the ones people are using, so the lack of their becomes less important.

PyPI cannot change the METADATA, it has to match what is inside the wheel itself.

I think I’m -1 on things where PyPI is asked to provide information that wasn’t provided by the projects themselves. Any sort of inferring process is going to have false positives, and that puts PyPI in a tricky place. We’ve also been bitten in the past by trying to “helpfully” change metadata for people and having them come back and get really mad at us for doing so.

It’s an unfortunate reality that anything new we add will have some kind of delayed implementation.

We could maybe implement something that would let people backfill for their own projects, some sort of METADATA patching solution? That would possibly solve a number of other long standing issues as well.

2 Likes

Seems like the best middle-ground scenario. Although, I’d still suggest a sidecar “backported-METADATA”, so the provided METADATA stays written in stone forever. But ultimately as the client I’m happy with anything I can query without downloading :smile: