Determining top-level import package names programmatically

kknechtel · February 2, 2024, 8:33pm

As far as I can tell, core metadata doesn’t store that kind of information about top-level package names, and setup.py doesn’t need to be present even in an sdist; so I really can’t imagine how this would work programmatically. I’m also not trying to be exhaustive here (or take up anyone’s resources like that).

CAM-Gerlach · February 3, 2024, 8:00am

For wheels you could check the top-level directory names (filtering out .data and .dist-info), either from the ZIP, via the RECORD, etc. Setuptools also includes a top_level.txt file in the .dist-info that is simply a newline-deliminated list of top-level import package names.

For sdists, Setuptools-produced sdists include a similar top_level.txt in the .egg-info directory; that wouldn’t work for non-Setuptools sdists, but I’d wager only a tiny fraction of modern projects built with non-Setuptools build systems aren’t uploading wheels, particularly those that are at least reasonably popular, and you could always just build a wheel with build and then do the above.

It has been proposed (which also includes some discussion of how one can extract that currently, including in a scenario such as this).

Not sure how this is relevant, sorry? Whether or not they happen to use the discouraged dynamic build script feature of the Setuptools backend, any valid sdists must still be buildable, and will either include a pyproject.toml or will fall back to the legacy Setuptools builder, and you can just use build as normal to build a wheel and check via the methods specified above.

kknechtel · February 3, 2024, 8:22am

I was thinking setup.py would often contain that information as e.g. a package_dir argument to setup; but of course it could equally well be reflected in pyproject.toml via [tool.setuptools.package-dir] if one is using Setuptools.

CAM-Gerlach · February 3, 2024, 6:25pm

Yeah, I considered suggesting attempting to parse backend-specific config options in pyproject.toml, but concluded this would only potentially help in a very few cases relative to the effort. AFAIK, many/most packages don’t have the top-level names enumerated explicitly and instead just use their backend’s auto-discovery, and this is only necessary to begin with for projects that have non-matching names, don’t publish wheels and don’t use Setuptools (as if they are, and thus setup.py/setup.cfg is relevant, you can just do import_names = Path(f"{dist_name}.egg-info/top_level.txt").read_text().split("\n") which is much simpler and more reliable).

kknechtel · February 3, 2024, 8:33pm

Oh, wait, there is that metadata. (I think you mean .dist-info rather than .egg-info, nowadays?) It does appear that other build backends don’t generate that top_level.txt file. Maybe it would be beneficial to mandate it? (Or else, what motivates Setuptools to create it if it isn’t necessary?)

CAM-Gerlach · February 3, 2024, 10:24pm

Yeah; as I mention in my reply above:

Sorry I was unclear; your comment I was originally responding to was referring to Setuptools setup.py dynamic build scripts in sdists, in which the Setuptools-specific location for metadata is under the .egg-info directory. As mentioned, top_level.txt is indeed under .dist-info for wheels and in installed projects, just not backend-specific sdists.

Right, though as mentioned for wheels you can determine the top-level import names from the RECORD as well as from the ZIP contents with a bit of light filtering (for .pth files and .dist-info directories, and possibly looking at .data to resolve any purelib/platlib complexities), and my comment here was specifically in the context of checking a setup.py in a sdist, which is only relevant for Setuptools ofc.

As mentioned, there’s a recent discussion about that

It’s a holdover from Setuptools’ legacy Egg binary distribution format:

This data is used by pkg_resources at runtime to issue a warning if an egg is added to sys.path when its contained packages may have already been imported.

(It was also once used to detect conflicts with non-egg packages at installation time, but in more recent versions, setuptools installs eggs in such a way that they always override non-egg packages, thus preventing a problem from arising.)

cameron · February 4, 2024, 9:35pm

Any list of well-known distribution/import name mismatches?

Shouldn’t this be something that someone with access to PyPI’s db could
answer with a query?

[… things which require inspecting (and therefore fetching) wheels and sdists …]

That sounds cumbersome.

Karl Knechtel:

As far as I can tell, core metadata doesn’t store that kind of information about top-level package names,

It has been
proposed
(which also includes some discussion of how one can extract that
currently, including in a scenario such as this).

I’m increasingly of the opinion that this should be queryable
(“queriable”?) for the following reason: malice.

I’m about half way through the discussion cited above and it’s all
“would this be useful?” and “where might we put this?” and “what should
be in it?” and not even an allusion to malicious packages until
here,
which is pretty short.

We’ve got an existing problem with typosquatting on project names. What
about innocuous PyPI project name which install their innocuous trite
package and also something malicious as a well known name (or close
typo)?

Having this at the top level in a queriable form lets us:

show what a project installs, import-wise
allow various sweeps of projects for conflicts and/or malice
have the installers (wheel unpackers etc) validate what’s being
installed against what is supposed to be being installed, and
reject installs not matching their spec

CAM-Gerlach · February 5, 2024, 12:13am

I’m not sure I see how this presents that much of an increased threat, since in order for the this secondary threat to be realized, people have to first install the package, which can already execute arbitrary code at install time (by only providing an sdist, or not only generating wheels with specific tags), and at Python invocation time via various other mechanisms. And if people or tools are actually investigating the package’s code to determine whether it is malicious, they will see any import packages it provides, regardless of name.

Just to be clear it still makes good sense to me to expose this information via standardized metadata, I’m just not sure about security as the strongest justification for it.

cameron · February 5, 2024, 12:26am

For sdists, I suppose so. But suppose someone uses pip install --only-binary thinking to avoid this (yes, I guess the maliciousness can still lurk in the installed code awaiting run time)?

Is it possible to make a malicious package X whose wheel installs X and also Y, where Y is a well known popular name like eg requests?

I think at least having some surety that the install process won’t install some name I didn’t expect (meaning a name not mentioned up front in the metadata) is a security measure.

jamestwebber · February 5, 2024, 2:15am

This seems like a pretty convoluted method. First you have to convince people to install your innocuous package–if it were easy to make a popular package we’d all be doing it.

Is it possible to install the malicious requests such that pip (or another tool) would believe you’ve installed is? Because otherwise you’re liable to be shadowed if they ever install the real one… And if they don’t they probably never execute your malicious code.

Topic		Replies	Views
Record the top-level names of a wheel in `METADATA`? Standards	69	4295	February 13, 2024
Sdist idea: require `pyproject.toml` and PEP 518/517 Packaging	19	1925	July 18, 2020
Query package metadata from pypi.org Packaging help	2	684	December 16, 2022
Add a module_names attribute to importlib.metadata.Distribution Ideas	2	623	July 17, 2023
Can I see what other packages list mine as a requirement? Packaging	15	1010	July 23, 2022

Determining top-level import package names programmatically

Related Topics