Record the top-level names of a wheel in `METADATA`?

thejcannon · August 2, 2023, 7:12pm

Yup that one, although I haven’t heard that this has really been an issue for anyone. (not to say it isn’t an issue, just if it was, people weren’t vocal)

fungi · August 2, 2023, 7:32pm

Yes, so far I’ve been assuming that tools which care to have a
single result for the reverse mapping would simply prompt the user
or require some other disambiguation signal when multiple results
are returned.

jamestwebber · August 2, 2023, 7:36pm

Then I guess I don’t understand what you were asking about. If there’s a replacement package for an older abandoned package and they use the same name, they’ll both show up in the list. What needs to be updated?

I think the distinction here is that nobody should be relying on this lookup to install their dependencies, because these names don’t uniquely specify dependencies and likely never will. But it would still be helpful.

fungi · August 2, 2023, 8:07pm

Sorry if I was unclear. I wasn’t asking about the proposal, I was asking Joshua how (or whether) the mapping system he described for Pantsbuild handles those specific situations, in case there are solutions to the choosing problem I hadn’t considered.

This forum doesn’t handle discussion threading very well, so it probably wasn’t obvious what I was responding to. I’m switching from my normal E-mail replying to using a Web browser to see if that helps at all.

brettcannon · August 2, 2023, 10:26pm

To help move this along, here’s a poll to see where people sit.

(How) should we record the top-level module namepsaces a package introduces?

top_level.txt file
The Provides field in METADATA
A new field in METADATA
Nowhere; this information isn’t useful enough

0 voters

thejcannon · August 3, 2023, 12:23am

Which answer fits best for “I don’t think I care which field in METADATA, so long as it’s there” (because I can query that from PyPI for newer wheels)

brettcannon · August 3, 2023, 9:27pm

Probably “a new field”, but honestly I would assume the combination of “new field” and “Provides” is going to provide guidance compared to top_level.txt.

hauntsaninja · August 4, 2023, 3:20am

I would love if we had better handling of namespace packages than what top_level.txt does.

thejcannon · August 5, 2023, 10:32pm

I want to explain my thinking here, and in the context of my use case. That way others can agree or disagree or, better yet, correct my fallacies.

[Edit I was just repeating things I said above, so I’m removing it for the sake of brevity]

So being able to start generating the mapping of imports → packages would be (chefs kiss). I somewhat know that @brettcannon has a similar-eniugh use case in mind for VS Code.

Now on to observations…

Having the new info in metadata means PEP 658 (oh man I WORSHIP PEP 658 at this point) makes it easy to scrape
buuuuut this only is helpful with historical data (for at least a period of several years). Meaning there is a reasonable demand for backporting
Buuuut I’m not sure what the expectation for that is. Should PyPI (et al) be modifying METADATA? (Can they even? I think changing the metadata hash is likely a bad idea). Maybe we need another PEP to allow package hosts to expose a “backported, and inferred METADATA”?

+1 to the idea, but we may need to take it farther in another PEP to not have to sit on our thumbs for a few years. Or at least, that’s how I see it from my (maybe narrow?) worldview.

brettcannon · August 6, 2023, 4:24am

That’s true to an extent and for a while. But over time more project will release new wheels and things will slowly backfill. And projects that don’t are much less likely to be the ones people are using, so the lack of their becomes less important.

dstufft · August 7, 2023, 3:23pm

PyPI cannot change the METADATA, it has to match what is inside the wheel itself.

I think I’m -1 on things where PyPI is asked to provide information that wasn’t provided by the projects themselves. Any sort of inferring process is going to have false positives, and that puts PyPI in a tricky place. We’ve also been bitten in the past by trying to “helpfully” change metadata for people and having them come back and get really mad at us for doing so.

It’s an unfortunate reality that anything new we add will have some kind of delayed implementation.

We could maybe implement something that would let people backfill for their own projects, some sort of METADATA patching solution? That would possibly solve a number of other long standing issues as well.

thejcannon · August 7, 2023, 3:42pm

Seems like the best middle-ground scenario. Although, I’d still suggest a sidecar “backported-METADATA”, so the provided METADATA stays written in stone forever. But ultimately as the client I’m happy with anything I can query without downloading

brettcannon · August 31, 2023, 11:24pm

With roughly 77% of people saying they would like to see something in METADATA, this seems worth pursuing. So, the next question is if we were to follow through with this, should it be a new field or Provides from core metadata 1.1 that was deprecated in 1.2?

Provides
New field

0 voters

brettcannon · September 9, 2023, 6:05pm

Looks like Provides is the preferred option.

So what’s next? Since Provides exists, do we just need to extend [project] to support this (and document it)? Do we want to make this a SHOULD for top-level packages or when the top-level package name differs from the distribution name and a MAY for all other cases (PEP 314 doesn’t say if it’s an all-or-nothing proposition for listing what a distribution contains)?

ofek · September 9, 2023, 8:55pm

Can we please say MUST NOT list anything but top-level packages? I don’t want to be pressured into enumerating every sub-package as the current example does.

edit: and of course maybe the only exception would be namespaced packages, which I don’t remember from this discussion how we decided to handle

EpicWink · September 10, 2023, 4:29am

Namespace packages would have to be supported in the obvious way (eg azure-storage-blob has Provides: azure.storage.blob).

I can’t think of a situation where a consumer has difficulty finding the most specific Provides value for a given import. However, I could see a security vulnerability with this where a malicious package advertises Provides: cryptography.hazmat.primitives and the consumer automatically installed the malicious package instead of cryptography.

abravalheri · September 10, 2023, 2:06pm

What happens if I want to create a namespace ns and my wheel provides both ns.pkg1 and ns.pkg2? Should we list both ns.pkg1 and ns.pkg2? Should we add ns to the list or exclude it? Or should we just have ns as the top most?

What if I have a namespace ns, but no Python module inside the namespace, only data files? Should ns be listed?

How about when ns is a legacy (non-PEP 420) namespace? (legacy namespaces are trickier to automatically detect).

I am assuming that for “metapackages” (no actual implementation, only defining dependencies/entry-points/etc…) it is OK to leave the field out.

@brettcannon I think it would be nice if we can have some agreement regarding the edge cases.

Also it would be good to rectify the core metadata spec to say that the field is no longer deprecated. That might require a PEP.

Another nice thing to have is the backward compatibility police for wheels/sdists created while the field was considered deprecated.

pf_moore · September 10, 2023, 4:03pm

The answer to this (and the other questions you ask) depends heavily on how this metadata is expected to be used. With a few exceptions, project metadata is really just documentation for human readers, not intended to be consumed by tools and as a result nowhere near reliable or well-defined enough for tools to do so. If we view this proposal as simply making it easier to write values into the Provides metadata, while not making any new promises about the reliability of that data, then I guess not having answers for these questions is acceptable. And if the only intended use case for this is to provide some sort of “reverse lookup search” capability on PyPI, then maybe that’s OK (searches don’t have to be reliable).

I actually think there’s a disconnect here - @brettcannon said

which suggests that he expects it to be manually entered by users in pyproject.toml. Which would make your questions largely irrelevant, as the data would simply be (in effect) user-provided free text, and as such there’s no guarantees at all.

But I thought what we were talking about was backends calculating this value (as best they could) and that does lead to all of the questions you are asking, and the concerns @ofek expressed. And it also adds the question of what a backend that can’t (or doesn’t want to) work out the exposed import names should do.

I think this needs a PEP. Maybe just to state “this is entered by the user in pyproject.toml and so should be considered precisely as reliable as any other user-supplied data”. But more likely, I think, to clearly set out what the contract is between backends which want to make this data available, and tools that want to consume it.

jamestwebber · September 10, 2023, 4:09pm

I feel like MUST NOT is too prescriptive, but perhaps this can be explicitly optional? As an example, matplotlib may or may not wish to add Provides: matplotlib.pyplot to their metadata–while it is fairly clear where to find pyplot, it’s still a little piece of useful info for newcomers.

If people rely on the ImportError to search for their solution, it’ll be for cryptography. If they look at the line of code that raised the error they might try to find the submodule directly.

This is a general problem if packages aren’t listing all their submodules ^[1]. If a user looks tries to import scipy.spatial.distance and looks for that specific module in PyPI it’s not going to be there. Perhaps the solution for this is a smarter search (“did you mean scipy?”) and explicit warnings if it looks like someone is offering a submodule from a different PyPI org than the rest of a namespace ^[2]

and the alternative seems way too noisy ↩︎
maybe this should be an automatic check run by PyPI generally? ↩︎

fungi · September 10, 2023, 6:55pm

Speaking of security risks, and looking at this stick from the other
end, is there an upper limit to the number of import package names a
single distribution package can declare? Are there any concerns with
malicious uploads polluting the index of import names (not just
impersonating popular modules but generally complicating or
conflating module name lookups)?