Pre-PEP: Import name metadata

Ah, sorry, I didn’t read your suggestion as serving the RECORD file directly from the index ala .metadata URLs to get the core metadata. That honestly could work if people would rather go that route since I suspect a lot of back-ends will just calculate the value for Import-Name based on the file structure anyway and it’s easy enough to do. But it does leave sdists out of the picture since RECORD is wheel-specific (but maybe that’s a good thing since sdists could materialize who knows what at build time?).

I’m happy to change the example if someone has a different project on PyPI that’s a namespace package.

Not directly, but I also don’t know how much people are worried about discrepancies.

If you take the suggestion from @ncoghlan that I reference above then “what files are actually in the package” becomes expanding the index API to serve the RECORD file from wheels, although that leaves sdists out (but as I mention above, maybe that isn’t a bad thing). Otherwise simply serving the file list in a generic way – i.e. like RECORD but with no hashes – could also work if that isn’t too costly for indexes to calculate and cache. Maybe something like a .contents URL (it could be an array of strings in the JSON version of the index API instead of a separate file, but the HTML API would be messy as Steve pointed out)?

1 Like

Similarly to the sklearn detail shared before, I’d like to present a use case which has been around in a project where I work, which is PySide6 (python bindings for Qt). Due to how large the Qt Framework is, at some point we decided to split the wheel, in order to avoid requesting more and more space on PyPI. So we split the package into 3: one that keeps the name of the package, and the other two that have the same import name, but include new modules. For example:

pip install PySide6-Essentials
# will enable people to use 'from PySide6.QtCore import X' (Qt Core is part of Essentials)
# will not enable people to use 'from PySide6.QtMultimedia import Y' (Qt Multimedia is an Addon)

this is because the QtMultimedia is provided by the PySide6-Addons package.
So we encourage people to run pip install pyside6 and that empty package has Essentials and Addons as dependencies, but all the of those packages have the Import-Name PySide6
(there is even another package that install the examples, but you get the idea).

I’m not sure if this configuration will enter the scope of azure.mgmt.search case you mention, because in this particular case, there are no packages for the missing submodules, but rather we enable more modules to import information from.

Would this scenario be invalid? or it could also work by setting Import-Name to PySide6 in all those packages?

2 Likes

If one of the packages provides the top level PySide6/__init__.py, that’s the only import name that distribution package would report.

The distribution packages that didn’t include that top level marker file would report the names of the specific submodules that they provided.

The use case does suggest to me that it may make sense to allow projects to publish submodule info if they choose to do so, since explicitly managed namespace packages (where the package initialisation extends its own __path__) are still fully supported, and those may want to list the namespace entries that are provided by default (in the same distribution package that also manages the import package namespace).

I think the only value for this as metadata is to allow projects to put any name they want to be found for in there. Think of it as a specific (almost unique) classifier.

Requiring or in any way forcing it to match the actual contents of the package is pointless, because we can extract reality (while missing numerous edge cases) at the same time as checking.

Thinking of this as a specific search keyword rather than a download optimisation seems far more useful (and opens up the possibility of “what if we just recommended keywords: "import:my_module"” and started searching for those).

1 Like

I would also love to see a solution that allows us to map import names to package names.

Wether that solution comes in the form of a wheel metadata extension, or a change to PyPI to expose a file list for each wheel in the metadata doesn’t really matter to me too much.

The way I see it, updating PyPI to expose additional metadata scraped from existing wheel contents would likely be easier to implement than a PEP, but would miss out of some edge cases (ie sdist-only packages that generate modules and namespaces at build-time)

Another use case where something like this would be really useful: the pantsbuild tool has a feature that lets you specify one of more entrypoints into a python monorepo, and it builds a graph of all the python modules imported by those entrypoints, and modules imported by those modules, and so one, and identifies all external packages depended on directly or transitively by those entrypoints, and dynamically constructs an installable wheel file.

They maintain a manual mapping of PyPI packags to import names to make this possible, and I’m suree they’d thoroughly enjoy not to have to maintain it anymore

Things seem to have settled with it seeming that people definitely want a solution, but it not being cleared as to which solution they prefer.

Options

Import-Name core metadata

(Or Import-Path if you’re the FLUFL.)

The pros to this is there isn’t any processing needed by consumers of the data as the names will already be written out. It also doesn’t require exposing a new piece of data somehow by index servers.

The con is it needs build back-end buy-in. It also may lead to incorrect data if build back-ends guess wrong or don’t provide a way to manually specify the information.

Serve RECORD

The pro is the file already exists in wheels and so nothing new to invent in that regard. You could also argue that it’s only for wheels is good compared to sdists since you can’t know what files where in an sdist will actually end up in a wheel. This can also be backfilled by indexes.

The cons are no sdist support (which as I said above, could be viewed as a pro since it prevents incorrect guessing of what files actually make up the package). It also requires any consumers to calculate import names instead of having build back-ends do it once for everyone.

Custom file format

Think RECORD but without hashes.

The pro is it could support sdists or any archive format.

The con is it supports any archive format. :wink: It also is yet another format (even if it is as simple as a file path per line).

Which one?

While people are occupied and maybe having conversations at PyCon US, here’s a poll that I will leave open for 2 weeks so it ends post PyCon US sprints. Hopefully a clear preference will show up.

  • Import-Name core metadata
  • Serve RECORD
  • Custom file format
0 voters

Is there a way to make option 2 and 3 not have the con of being potentially incorrect? Or is there a different reason you did list this for option 1, but not option 2&3?

In fact, I would argue that the ability for the package author to potentially overwrite it is a pro of option 1 - the package author/build backend are going to have far better knowledge than an arbitrary tool that has to guess. There also isn’t a good way for these tools to learn if they make a mistake unless the user manually corrects them or the tool authors hard code stuff into them - Unless there is a public list somewhere with such corrections which really isn’t better than option 1 then IMO.

1 Like

No because you won’t be inspecting code or having the authors participate in setting the details of how you calculate what the import names would be. Now that isn’t saying it won’t be correct the vast majority of time for what’s calculated, but it my be off on occasion or incomplete if __path__ manipulations are involved.

If it goes into RECORD or core metadata, does that introduce the possibility that the data varies amongst the wheels for a package?

(Not saying I want to do that! I don’t!)

1 Like

I changed my mind at some point during this discussion but didn’t have time to post. I was very eager for a solution originally and although the proposal puts us in a much better spot I now view the metadata field as entirely superfluous since we already have the RECORD file. I think not supporting source distributions is correct because one does not install that directly and it simply wouldn’t make sense to define the spec in terms of anything but the final directory structure. Another bonus is that no build backend or other ecosystem-level changes are required.

I would like to note that although I chose the poll option to serve the RECORD file, I think more people would choose that option if you make clear that indices could serve some endpoint where the top-level paths in that file would be exposed rather than consumers all doing the same logic. The way it reads now might be interpreted as only serving that raw file directly and that’s it.

2 Likes

FWIW, that’s exactly how I interpreted that option. And I didn’t like the idea of every consumer having to parse that data, which is why I didn’t vote for that option.

I think if indexes serve some other representation, that counts as a custom/new format for the data, doesn’t it?

I interpreted the custom file format option to mean a literal file that a build backend is required to embed in a distribution.

I think we should do a new poll with option 2 stating that indices may serve the top level import paths from parsing the RECORD file in wheels. I’m pretty sure that would significantly change the votes.

1 Like

I agree the metadata field is superfluous,[1] but I still voted for it because I’d rather have a labelled, customisable field than purely machine driven data.

There are two reasons to want to obtain this information directly from the index (without downloading/installing the package), and both look like search.

A metadata field can answer “if I want to import x, should I install this package?” far more accurately than mechanically extracted information, simply because non-trivial packages can choose what to put there (or what not to put there).

Installing setuptools to satisfy import distutils, for example, is not going to be handled by looking in RECORD (you’ll find _distutils_hack that way, which nobody should be importing). But if the maintainers can specify metadata that says “if you want to import distutils, install us” then they can start showing up in (effectively) search results.

So my assumptions about the answers were that the metadata fields would be entirely customisable[2], and the served RECORD would be literal and per-distribution artifact (much the same as doing the ranged download trick to only get the dist-info out of a wheel). But if the metadata field won’t be specified by the packager, I’m -1 on all the options.


  1. We could easily standardise a keyword prefix or define a classifier instead. ↩︎

  2. Maybe some backends may choose to infer it for you, but that’s neither a requirement nor recommended. ↩︎

2 Likes

Perhaps the spec could include a statement that packages SHOULD / MUST have the same value for all published distributions of a given version?

(Although come to think of it, that may be nontrivial for build backends to implement since they don’t know the full set of wheels being built?)

Not only would it be non-trivial, but it’s not true for all packages that each built distribution contains the same importable names.

I’m +1 on metadata if and only if it’s not required to be exhaustively correct, only informative of intended supported import names, which covers the search use case. (this wouldn’t include something like any name mangled imports done by a backend for vendoring reasons).

Otherwise, I’m -1 on all options here without more detail actually resolving this core problem, as we’re going to exacerbate the current issues of metadata being accurate to the projects intent vs accurate to the distributions

2 Likes

Personally, I’m mostly against any automated calculation of the value. I’d much rather the default be for the developer to specify the import names their package exposes manually, because that declares the intent rather than the implementation details. I’d like the default to be explicitly stated as “if the import name metadata is missing, tools can assume that the import name is the same as the package name”. Obviously, that’s not always true, but I don’t think that we should be aiming for 100% reliability here.

I don’t mind allowing the import name being declared as dynamic, with the build backend calculating it, but I think that should be opt-in, and not something that build backends are required to provide.

5 Likes

(I voted 1 but would be thrilled if the solution became 2, since the underlying problem gets way more solvable with any solution)

In case I missed it or it went unsaid, option 2 would be immediately backwards compatible with existing wheels, yeah?

I’ve recently learned more about the license metadata problem (PEP 639 but also the current effort to follow that up with something more), and it has given me more perspective about this proposal.
I’d summarize it as:

We need to declare whether this is nominally project metadata or distribution metadata.

This is basically a +1 to what @mikeshardmind said, and dovetails well with some points others have made in-thread.

Project vs Distribution is the key issue with licenses right now, so let’s tread carefully.

The field may be stored in a distribution, but describe the project to which that distribution belongs. We only have the distributions as a place where build pipelines can even put the metadata – we could talk about the possibility of changing that someday, but it’s way out of scope for this proposal.

If the data describes the project, which is what I think it should do, then

  • even if it goes in the dists, we can safely say it should be the same for every dist
  • it’s clear that indexes can serve the data directly and it will be accurate to the packager’s intent
  • deriving it from RECORD or otherwise can be defined as an operation across many dists

That’s not an exhaustive list, just putting some color to what it would mean.

(Aside: this would change my vote in the poll to a metadata field, but oh well. :slightly_smiling_face:)

2 Likes

Yes.

That’s a different format than RECORD, so I consider that option 3. Think of option 2 as the indexes just have to extract the RECORD file and serve it. Option 3 is anything more complicated than that, and changing the file counts as “more complicated”. :wink:

It does:

I honestly thought I had already said that in the proposal, but I seemed to have missed it. I expect the info to be accurate, but not exhaustive.

Correct.

OK, but that leads into …

So does that mean you’re arguing for a key in [project], @pf_moore ? The license discussion spooked me enough to punt on it and do this route of defining the core metadata field only. But if people want a key I can put that in and go with your suggestion that if the key is left out then it’s okay to assume the project name is the import name.

I never expected this to be required metadata.

1 Like

I had assumed it would be, simply because at this point having a new metadata field that can’t be specified in pyproject.toml would be weird. But I didn’t check the proposal, so I missed that you hadn’t said that.

I would want it to be a key in [project], yes. On reflection, I think my wording was a bit strong when I talked about “defaulting to” the project name. But if we say that if Import-Name is omitted, then tools may assume it’s the project name, that seems reasonable. We should probably note explicitly that “accurate, but not exhaustive” doesn’t apply in that case, though (i.e., you can assume the project name, but that might not be correct).

My point wasn’t that it should be required or not. What I was saying was that in spite of the comments that “the backend can calculate this” and “it’s derivable from RECORD”, I don’t want there to be an expectation that backends will do this. It’s a minor point, though - backends can do (or not do) whatever they prefer.

1 Like