I think this would be a nice addition that would have a couple of additional benefits:
It could help clearly eliminate one class of typosquatting if utlized on PyPI, perhaps by automatically setting up the kinds of redirects you described for sckit-learn. Users often are more aware of the import name of a package than the actual package name.
If this data was collected across packages, it could be used to create something like the command-not-found tool for imports, which provides installation candidates for apt given input:
Command 'pyt' not found, did you mean:
command 'py' from deb pythonpy
command 'iyt' from deb python3-yt
command 'pat' from deb dist
command 'yt' from deb python3-yt
command 'pyp' from deb pyp
command 'pytr' from deb pytrainer
While it sounds tempting, you can’t really block “typosquatting” of
import names or set up redirects. In fact there are very good
reasons for multiple packages on PyPI to provide modules which use
the same import names, so collisions over those should be expected.
Any sort of search based on an import name should be prepared to
return multiple matches. It will presumably be up to the user to
decide which one they’re looking for in that circumstance.
Would this standard help more packages/installation methods provide a declaration somewhere of import names? I’m interested in this sort of thing as I once wrote a personal hobby tool that depends on mapping imported names to distributions, currently using importlib.metadata.packages_distributions.
That seemed to work fine on the packages/installation methods I tried (including conda which is an important use case for me) but it’s not as satisfying to tell users “this should usually work but there’s no official reason to expect it to work for all of the packages you’re using”. AFAIK the RECORD file is purely optional.
Out of curiosity, what approach do you take when a drop-in
replacement package for an older abandoned project starts to gain in
popularity (specifically in the case where both the old and new
distribution packages provide the same import package name)? Does
the official mapping get updated and all projects relying on that
migrate at the same time, or do projects have the ability to
override the mapping locally on a one-by-one basis?
Similarly, I’ve had some projects go through a rename, updating the
distribution package name while keeping the original import package
name (either entirely or as a backward-compatible alias), though I
suppose those transitions are less problematic as long as there’s
not a corresponding Python version requirement shift. Or do your
mappings have the ability to vary by Python version, like how
environment markers work?
I don’t think anyone is suggesting a one-to-one mapping. That’s not realistic given all the existing packages, and has the issues you mentioned. If multiple packages provide the same import name, a tool could list the options.
Yes, so far I’ve been assuming that tools which care to have a
single result for the reverse mapping would simply prompt the user
or require some other disambiguation signal when multiple results
Then I guess I don’t understand what you were asking about. If there’s a replacement package for an older abandoned package and they use the same name, they’ll both show up in the list. What needs to be updated?
I think the distinction here is that nobody should be relying on this lookup to install their dependencies, because these names don’t uniquely specify dependencies and likely never will. But it would still be helpful.
Sorry if I was unclear. I wasn’t asking about the proposal, I was asking Joshua how (or whether) the mapping system he described for Pantsbuild handles those specific situations, in case there are solutions to the choosing problem I hadn’t considered.
This forum doesn’t handle discussion threading very well, so it probably wasn’t obvious what I was responding to. I’m switching from my normal E-mail replying to using a Web browser to see if that helps at all.
I want to explain my thinking here, and in the context of my use case. That way others can agree or disagree or, better yet, correct my fallacies.
[Edit I was just repeating things I said above, so I’m removing it for the sake of brevity]
So being able to start generating the mapping of imports → packages would be (chefs kiss). I somewhat know that @brettcannon has a similar-eniugh use case in mind for VS Code.
Now on to observations…
Having the new info in metadata means PEP 658 (oh man I WORSHIP PEP 658 at this point) makes it easy to scrape
buuuuut this only is helpful with historical data (for at least a period of several years). Meaning there is a reasonable demand for backporting
Buuuut I’m not sure what the expectation for that is. Should PyPI (et al) be modifying METADATA? (Can they even? I think changing the metadata hash is likely a bad idea). Maybe we need another PEP to allow package hosts to expose a “backported, and inferred METADATA”?
+1 to the idea, but we may need to take it farther in another PEP to not have to sit on our thumbs for a few years. Or at least, that’s how I see it from my (maybe narrow?) worldview.
That’s true to an extent and for a while. But over time more project will release new wheels and things will slowly backfill. And projects that don’t are much less likely to be the ones people are using, so the lack of their becomes less important.
PyPI cannot change the METADATA, it has to match what is inside the wheel itself.
I think I’m -1 on things where PyPI is asked to provide information that wasn’t provided by the projects themselves. Any sort of inferring process is going to have false positives, and that puts PyPI in a tricky place. We’ve also been bitten in the past by trying to “helpfully” change metadata for people and having them come back and get really mad at us for doing so.
It’s an unfortunate reality that anything new we add will have some kind of delayed implementation.
We could maybe implement something that would let people backfill for their own projects, some sort of METADATA patching solution? That would possibly solve a number of other long standing issues as well.
Seems like the best middle-ground scenario. Although, I’d still suggest a sidecar “backported-METADATA”, so the provided METADATA stays written in stone forever. But ultimately as the client I’m happy with anything I can query without downloading