Typically, sure, but we all know users don’t always care about “correct most of the time” when it’s holding up their work. I mean we will very likely do this, maybe with a handful of very common special-cases (looking at you, Pillow and scikit-learn), but obviously having more accurate information to cover more cases would be appreciated.
I wouldn’t call it unsuitable, but METADATA has the perk of being available without downloading the sdist or wheel entirely via PEP 658.
Let’s not have PEP 658 be a reason for dumping everything into Metadata…
I’m not saying this shouldn’t go into Metadata, but I think it needs a good reason beyond this. PyPI can index top-level.txt at the same time it extracts metadata, and there is precedent in direct-url.json for using separate files, so I think we should look at each case on its merits. For me, existing usage is a significant advantage (it’s already supported by importlib.metadata, for example).
Sorry, a small correction on the information I gave previously: the correct file name is top_level.txt (with the “underscore” character, like entry_points.txt).
It should be fairly reliable and it seems it been produced since setuptools 0.5a9, so a good chunk of published packages should have it. Its implementation does not seem to have varied significantly over the years (I can only see formatting changes), so it should also be fairly stable.
setuptools get its value by iterating though its packages, py_modules and ext_modules configuration fields (either explicitly given by the user or “auto-discovered”) and yielding only the toplevel names.
As far as I know, the limitations would be the following:
No nested module/package is listed (so in the case multiple distributions share the same namespace, their top_level.txt file would be the same)[1]
If any plugin or build customisation bypasses setuptools mechanisms to add files directly to the distribution archives, they will not be considered when deriving top_level.txt
If a regular build is customised to include a .pth to inject modules from another location, it is very likely those are also not considered when deriving top_level.txt.
That sounds pretty good, and I’d support standardising this so that tools can reasonably expect this to be present regardless of build backend. (I suppose we shouldn’t be surprised at the possibility that setuptools solved this problem years ago - there were a lot of good ideas implemented in setuptools before PEP 517, that aren’t immediately obvious when you’re writing a new backend from scratch).
After inspecting the following setuptools files, I would say that the top_level.txt should be available for sdists too (although in a different location, i.e., inside the *.egg-info directory that setuptools includes in sdists).
I looked into a few examples as evidence (via inspector), and the file is indeed available in: build-0.9.0.tar.gz, pip-23.2.tar.gz, boto3-1.28.5.tar.gz, python-dateutil-2.8.2.tar.gz, matplotlib-3.7.2.tar.gz, pandas-2.0.3.tar.gz and others.
However, it is possible to prevent the content of the .egg-info folder for being included in the sdist with customisations (e.g. via MANIFEST.in or by changing command implementations via cmdclass)[1].
Probably the case why I cannot find them in numpy-1.25.0.tar.gz↩︎
The sdist spec suggests that it should be in the PKG-INFO directory. But as that spec only applies for sdists which use Metadata 2.2, that’s more of a “what things will be like in the future” comment. I’d say that the general principle (at least as applied by pip) is that nothing in a[1] sdist is currently reliable, so YMMV.
@abravalheri’s answer is more practical right now, though
With what idea precisely? Standardising top_level.txt, or adding a new metadata field? Personally, I’m OK with the former, but not really comfortable with the latter at this point.
I personally feel like adding it to METADATA makes sense, I don’t see a particular reason to make it a separate file. It’s not like it’s going to be particularly large, etc.
Metadata arguably already has/had a place for this information: the Provides field - added in metadata version 1.1, deprecated in 1.2 (2010), but never actually removed as far as I can see.
Flit has been putting the top-level import name in Provides for ages. With namespace packages, it should supply a name like namespace.my_bit, i.e. the subpackage that this distribution provides.
Each entry contains a string describing a package or module that will be provided by this package once it is installed. These strings should match the ones used in Requirements fields.
And the requires field that’s referencing:
The format of a requirement string is identical to that of a module or package name usable with the ‘import’ statement, optionally followed by a version declaration within parentheses.
Strictly that doesn’t say that it is an import name, only that it’s in the same format (plus a version number), but all the examples are import names, and it’s the obvious thing to do with it.
In practical terms, I guess it might be easier to specify a new field than to revive the old one and deal with whatever people might have put there over the last 20 years. But I thought I’d give a gentle plug for making use of what we’ve already got.
I think this would be a nice addition that would have a couple of additional benefits:
It could help clearly eliminate one class of typosquatting if utlized on PyPI, perhaps by automatically setting up the kinds of redirects you described for sckit-learn. Users often are more aware of the import name of a package than the actual package name.
If this data was collected across packages, it could be used to create something like the command-not-found tool for imports, which provides installation candidates for apt given input:
$ pyt
Command 'pyt' not found, did you mean:
command 'py' from deb pythonpy
command 'iyt' from deb python3-yt
command 'pat' from deb dist
command 'yt' from deb python3-yt
command 'pyp' from deb pyp
command 'pytr' from deb pytrainer
While it sounds tempting, you can’t really block “typosquatting” of
import names or set up redirects. In fact there are very good
reasons for multiple packages on PyPI to provide modules which use
the same import names, so collisions over those should be expected.
Any sort of search based on an import name should be prepared to
return multiple matches. It will presumably be up to the user to
decide which one they’re looking for in that circumstance.
Would this standard help more packages/installation methods provide a declaration somewhere of import names? I’m interested in this sort of thing as I once wrote a personal hobby tool that depends on mapping imported names to distributions, currently using importlib.metadata.packages_distributions.
That seemed to work fine on the packages/installation methods I tried (including conda which is an important use case for me) but it’s not as satisfying to tell users “this should usually work but there’s no official reason to expect it to work for all of the packages you’re using”. AFAIK the RECORD file is purely optional.
I’m very +1 on this (mostly on the “reverse-lookup” part).
For Pantsbuild, we have to map import names to code (either third party packages or first party code).
Our approach is to first look up the name in a map of “known” packages that map to modules that don’t match their package names, with a fallback of the package name is module name. Users can add their own mapping to help us fill in the blanks.
So a server we could ping and cache the result would be immensely helpful.
Out of curiosity, what approach do you take when a drop-in
replacement package for an older abandoned project starts to gain in
popularity (specifically in the case where both the old and new
distribution packages provide the same import package name)? Does
the official mapping get updated and all projects relying on that
migrate at the same time, or do projects have the ability to
override the mapping locally on a one-by-one basis?
Similarly, I’ve had some projects go through a rename, updating the
distribution package name while keeping the original import package
name (either entirely or as a backward-compatible alias), though I
suppose those transitions are less problematic as long as there’s
not a corresponding Python version requirement shift. Or do your
mappings have the ability to vary by Python version, like how
environment markers work?
I don’t think anyone is suggesting a one-to-one mapping. That’s not realistic given all the existing packages, and has the issues you mentioned. If multiple packages provide the same import name, a tool could list the options.