Record the top-level names of a wheel in `METADATA`?

Probably “a new field”, but honestly I would assume the combination of “new field” and “Provides” is going to provide guidance compared to top_level.txt.

I would love if we had better handling of namespace packages than what top_level.txt does.

3 Likes

I want to explain my thinking here, and in the context of my use case. That way others can agree or disagree or, better yet, correct my fallacies.

[Edit I was just repeating things I said above, so I’m removing it for the sake of brevity]

So being able to start generating the mapping of imports → packages would be (chefs kiss). I somewhat know that @brettcannon has a similar-eniugh use case in mind for VS Code.

Now on to observations…

  • Having the new info in metadata means PEP 658 (oh man I WORSHIP PEP 658 at this point) makes it easy to scrape
  • buuuuut this only is helpful with historical data (for at least a period of several years). Meaning there is a reasonable demand for backporting
  • Buuuut I’m not sure what the expectation for that is. Should PyPI (et al) be modifying METADATA? (Can they even? I think changing the metadata hash is likely a bad idea). Maybe we need another PEP to allow package hosts to expose a “backported, and inferred METADATA”?

+1 to the idea, but we may need to take it farther in another PEP to not have to sit on our thumbs for a few years. Or at least, that’s how I see it from my (maybe narrow?) worldview.

1 Like

That’s true to an extent and for a while. But over time more project will release new wheels and things will slowly backfill. And projects that don’t are much less likely to be the ones people are using, so the lack of their becomes less important.

PyPI cannot change the METADATA, it has to match what is inside the wheel itself.

I think I’m -1 on things where PyPI is asked to provide information that wasn’t provided by the projects themselves. Any sort of inferring process is going to have false positives, and that puts PyPI in a tricky place. We’ve also been bitten in the past by trying to “helpfully” change metadata for people and having them come back and get really mad at us for doing so.

It’s an unfortunate reality that anything new we add will have some kind of delayed implementation.

We could maybe implement something that would let people backfill for their own projects, some sort of METADATA patching solution? That would possibly solve a number of other long standing issues as well.

2 Likes

Seems like the best middle-ground scenario. Although, I’d still suggest a sidecar “backported-METADATA”, so the provided METADATA stays written in stone forever. But ultimately as the client I’m happy with anything I can query without downloading :smile:

With roughly 77% of people saying they would like to see something in METADATA, this seems worth pursuing. So, the next question is if we were to follow through with this, should it be a new field or Provides from core metadata 1.1 that was deprecated in 1.2?

0 voters
1 Like

Looks like Provides is the preferred option.

So what’s next? Since Provides exists, do we just need to extend [project] to support this (and document it)? Do we want to make this a SHOULD for top-level packages or when the top-level package name differs from the distribution name and a MAY for all other cases (PEP 314 doesn’t say if it’s an all-or-nothing proposition for listing what a distribution contains)?

Can we please say MUST NOT list anything but top-level packages? I don’t want to be pressured into enumerating every sub-package as the current example does.

edit: and of course maybe the only exception would be namespaced packages, which I don’t remember from this discussion how we decided to handle

3 Likes

Namespace packages would have to be supported in the obvious way (eg azure-storage-blob has Provides: azure.storage.blob).

I can’t think of a situation where a consumer has difficulty finding the most specific Provides value for a given import. However, I could see a security vulnerability with this where a malicious package advertises Provides: cryptography.hazmat.primitives and the consumer automatically installed the malicious package instead of cryptography.

1 Like

What happens if I want to create a namespace ns and my wheel provides both ns.pkg1 and ns.pkg2? Should we list both ns.pkg1 and ns.pkg2? Should we add ns to the list or exclude it? Or should we just have ns as the top most?

What if I have a namespace ns, but no Python module inside the namespace, only data files? Should ns be listed?

How about when ns is a legacy (non-PEP 420) namespace? (legacy namespaces are trickier to automatically detect).

I am assuming that for “metapackages” (no actual implementation, only defining dependencies/entry-points/etc…) it is OK to leave the field out.

@brettcannon I think it would be nice if we can have some agreement regarding the edge cases.

Also it would be good to rectify the core metadata spec to say that the field is no longer deprecated. That might require a PEP.

Another nice thing to have is the backward compatibility police for wheels/sdists created while the field was considered deprecated.

1 Like

The answer to this (and the other questions you ask) depends heavily on how this metadata is expected to be used. With a few exceptions, project metadata is really just documentation for human readers, not intended to be consumed by tools and as a result nowhere near reliable or well-defined enough for tools to do so. If we view this proposal as simply making it easier to write values into the Provides metadata, while not making any new promises about the reliability of that data, then I guess not having answers for these questions is acceptable. And if the only intended use case for this is to provide some sort of “reverse lookup search” capability on PyPI, then maybe that’s OK (searches don’t have to be reliable).

I actually think there’s a disconnect here - @brettcannon said

which suggests that he expects it to be manually entered by users in pyproject.toml. Which would make your questions largely irrelevant, as the data would simply be (in effect) user-provided free text, and as such there’s no guarantees at all.

But I thought what we were talking about was backends calculating this value (as best they could) and that does lead to all of the questions you are asking, and the concerns @ofek expressed. And it also adds the question of what a backend that can’t (or doesn’t want to) work out the exposed import names should do.

I think this needs a PEP. Maybe just to state “this is entered by the user in pyproject.toml and so should be considered precisely as reliable as any other user-supplied data”. But more likely, I think, to clearly set out what the contract is between backends which want to make this data available, and tools that want to consume it.

4 Likes

I feel like MUST NOT is too prescriptive, but perhaps this can be explicitly optional? As an example, matplotlib may or may not wish to add Provides: matplotlib.pyplot to their metadata–while it is fairly clear where to find pyplot, it’s still a little piece of useful info for newcomers.

If people rely on the ImportError to search for their solution, it’ll be for cryptography. If they look at the line of code that raised the error they might try to find the submodule directly.

This is a general problem if packages aren’t listing all their submodules [1]. If a user looks tries to import scipy.spatial.distance and looks for that specific module in PyPI it’s not going to be there. Perhaps the solution for this is a smarter search (“did you mean scipy?”) and explicit warnings if it looks like someone is offering a submodule from a different PyPI org than the rest of a namespace [2]


  1. and the alternative seems way too noisy ↩︎

  2. maybe this should be an automatic check run by PyPI generally? ↩︎

Speaking of security risks, and looking at this stick from the other
end, is there an upper limit to the number of import package names a
single distribution package can declare? Are there any concerns with
malicious uploads polluting the index of import names (not just
impersonating popular modules but generally complicating or
conflating module name lookups)?

2 Likes

“expect” isn’t quite right if it’s being prescribed to me. I’m very happy with it being calculated and only being the top-level name since that’s “claimed” by an installation of a project. I only brought this up as Provides is currently defined in a PEP to be for any/all names a distribution provides which no back-end can definitively calculate thanks to things like __path__ manipulations (pywin32 I believe typically gets mentioned at this point).

I’m personally only interested in the top-level names. If people want a PEP to reclaim Provides for top-level only and no pyproject.toml under the assumption that back-ends will fill that detail in based on what would get unpacked then I’m happy with writing that PEP. I’m also happy to skip any PEP and say we just document Provides with a SHOULD for top-level names and MAY for everything else.

Yes, but that’s up to whomever chooses to index things. I would assume most projects would be given some hard cap and other metrics such as download count or something would help sort out people trying to fool people into using a malicious package.

1 Like

I have another dream which is related to this one. It would be handy to have a mapping of executables to PyPI distribution.

On Ubuntu, here’s what happens if I run a command that is not found:

$ foobar
Command 'foobar' not found, but can be installed with:
sudo snap install foobar  # version 0.12.3, or
sudo apt  install foobar  # version 0.12.2-2

I’m dreaming of a future where PyPI is searched for executables as well, so that you could see output like:

$ cowsay
Command 'cowsay' not found, but can be installed with:
sudo apt install cowsay  # perl implementation
pipx install cowsay  # python implementation

For that feature to work, there would need to be a mapping of executables to PyPI projects. This mapping would preferably be stored offline, if possible. PyPI’s API could also offer this information.

If people are interested in this, let’s create a new thread, as I don’t want to derail the conversation. I only bring it up here because I can see how the implementations of this mapping and the module mapping might overlap.

1 Like

Is this not already covered by the entry point metadata?

1 Like

I’m not sure. I downloaded the files from pyright to check. When I download the .tar.gz tarball, I can find the executable listed in entry_points.txt under the pyright-1.1.337/pyright.egg-info/ in the archive. When I download the .wheel file, I can’t find the executable listed in METADATA, but I can find it listed in entry_points.txt.

I also had a quick look at the PyPI API, and I don’t think it returns entry points or console_scripts. There also isn’t an API that will redirect you to a project given a particular executable name.

Entry points are always listed in entry_points.txt, even in a wheel.

The entry points spec and core metadata spec should help.