Why isn't source distribution metadata trustworthy? Can we make it so?

pganssle · January 23, 2020, 5:15pm

This is because the input isn’t reliably deterministic. Consider the extreme example from Dustin’s blog post on this:

from setuptools import setup
import random

setup(
  name="paradox",
  version="0.0.1",
  description="A non-deterministic package",
  install_requires=[random.choice(["Dep1", "Dep2"])]
)

The much more common scenario is one where the dependencies are generated based on the platform that’s building from sdist, and this use case has been replaced with environment markers (that most people don’t know about):

install_requires = ["Dep1"]
if sys.version_info < (3, 7):
  install_requires.append("importlib-metadata")

setup(...,
    install_requires=install_requires
)

By the time it gets to setuptools, it’s just a list, and we don’t know if it was generated dynamically or not. If the dependencies are specified in setup.cfg, we know they are reliable and there’s an open issue to fix this. As others in the thread have mentioned, we can almost certainly parse setup.py with an AST and in many basic cases determine whether the dependencies are deterministic or not.

Most of the options for “banning dynamic metadata” are not great and have the potential to break stuff that would probably already just work in most scenarios, but if we decided the cost was worth paying, I’m curious to know if we would be stymied because there are legitimate use cases that we won’t be able to support in deterministic metadata implementations in a realistic time frame.

I’m also curious to know if this is just install requires or if there are places where the metadata is being set “dynamically”. The one use case I know of / have for that is that dateutil does a search-and-replace in README.rst during the build, because PyPI doesn’t support .. doctest::. It’s still deterministic, but it would be difficult to detect that it’s deterministic through heuristics.

brettcannon · January 23, 2020, 7:50pm

To tack on to the nomenclature confusion, when I hear static vs dynamic my brain keeps trying to put the version issue into there thanks to e.g. setuptools_scm calculating the version “dynamically” when setup() runs (same goes for people who use open() to paste in their long description).

But I think the key thing that’s being asked is static versus “dynamically environment-dependent” to differentiate from the “statically environment-dependent” that markers support).

And the only other thing I can think of along these lines are file inclusion, maybe entry points (and this is a guess; I have no real-world examples to back this up).

pf_moore · January 23, 2020, 8:02pm

Well, in the case of requests, it actually is deterministic (in the sense of “does not depend on any external factor”), but I take your point that it’s not possible to verify that if the data is generated via setup.py.

I’ve certainly been guilty of using the terms “static” and “dynamic” sloppily. For me, the key point is “if a metadata value is specified in the sdist, and I build a wheel from that sdist, can I be sure that the metadata value in the wheel will be the same as the one from the sdist?” I don’t have a good term for that property, to be honest.

I did some experiments to verify what’s going on here, building a sdist and a wheel for requests. It looks like setuptools simply doesn’t include all the metadata in the sdist. I assume based on what you’re saying, that this is actually a deliberate decision by setuptools - if it can’t be sure the data is going to be the same as the wheel, it omits it? Although I’m not clear in that case why you feel comfortable to include the Requires-Python metadata, which can surely differ between the sdist and the wheel for exactly the same reason?

For pip’s use case, which is what triggered @chrahunt’s original post here. it seems like we need three things:

The implementation of this feature request that you mentioned above.
Some way for pip to know whether the lack of Requires-Dist (and Requires-Python, and maybe others) in the sdist metadata means “there are no dependencies” or “you need to call the build backend to get this data”. At the moment, both of these are signalled by the metadata not being present in the sdist.
An assurance that any metadata values that are present in the sdist, will be the same in the wheel built from that sdist. That assurance could (at least as far as I’m concerned) simply be in the form of a statement that “consumers are allowed to assume that if a metadata item is in the sdist, then it will be the same in the wheel”, making projects that violate this rule are unsupported. Then the problem boils down to how the user and the build tool agree what can be included in the sdist.

Also, other tools that generate sdists need to follow the same rules, so they need to be written up as an interop standard - but that’s a bit of bureaucracy that can be done once we have a consensus.

jwodder · January 23, 2020, 8:24pm

I did some experiments to verify what’s going on here, building a sdist and a wheel for requests . It looks like setuptools simply doesn’t include all the metadata in the sdist. I assume based on what you’re saying, that this is actually a deliberate decision by setuptools - if it can’t be sure the data is going to be the same as the wheel, it omits it? Although I’m not clear in that case why you feel comfortable to include the Requires-Python metadata, which can surely differ between the sdist and the wheel for exactly the same reason?

setuptools actually stores sdists’ Requires-Dist metadata in $PROJECT.egg-info/requires.txt. I suspect this is due to some historical reason; Requires-Dist was only added to the metadata standard by PEP 345 (corresponding to Python 2.5), which (I believe) postdates setuptools’ support for install_requires. Between those two points in time, setuptools couldn’t store requirements in PKG-INFO as it wasn’t supported there, so it used its own metadata files, and they apparently never bothered to change it afterwards. On the other hand, support for Requires-Python, if I remember correctly, was added after the corresponding PEP came out.

pganssle · January 23, 2020, 8:54pm

TBH it’s probably true of everything in the metadata file, it’s just that I’ve never heard of anyone setting platform-dependent values for anything other than requirements, so from a practical point of view it’s just something we don’t have to worry about.

I think @jwodder is likely correct as to why Requires-Dist is treated differently, though that may be just a stroke of good fortune since it would be fairly common for the Requires-Dist information in an sdist to be inaccurate for a given platform.

I think this is one of the options for banning this “dynamic” metadata (I’ll keep using this term until we come up with something better, I guess), but it’s not really going to prevent people from continuing to generate “broken” metadata in this way. People will open tickets in pip or whatever project saying, “Such and such project has the wrong dependencies according to X command”, and then you’ll close the ticket with, “X should be doing the right thing”, and maybe X will hear about it and complain, “How the hell was I supposed to know this?” I doubt it’ll move the needle on the status quo.

I think we can come up with a transition plan to move people away from “bad metadata” and on to “good metadata”, but I think maybe it’ll take a decent number of developer-hours and might have to encompass more than just the Requires-Dist part. Maybe we can say, “OK, we’ll drop support for the legacy system even before we get our act together and start moving people away from it, since the things we’re dropping support for are all new features blocked on this anyway”, but I think there are more than a few things out there in packaging especially where the old way is deprecated and the new way is not ready yet . It doesn’t help our reputation to add another one of those things.

chrahunt · January 24, 2020, 12:51am

I think the required PKG-INFO statement in PEP 517 may be sufficient:

A .tar.gz source distribution (sdist) contains a single top-level directory called {name}-{version} (e.g. foo-1.0), containing the source files of the package. This directory must also contain the pyproject.toml from the build directory, and a PKG-INFO file containing metadata in the format described in PEP 345. Although historically zip files have also been used as sdists, this hook should produce a gzipped tarball. This is already the more common format for sdists, and having a consistent format makes for simpler tooling.

pradyunsg · January 24, 2020, 12:56pm

non-deterministic is what I’ve used for things like that.

Overall, I think adding a field to indicate the same is an approach that makes a lot of sense.

brettcannon · January 24, 2020, 6:31pm

Works for me as that can encompass externally-influenced-at-build-time.

Probably not as long as build tools support executable code for gathering metadata which can’t really be controlled for.

Could pip do a comparison after a wheel build and raise a warning stating that the metadata differs and you should contact the maintainers to make the metadata consistent?

I agree that having the end-to-end solution in place before flags are raised by tools to say something is out of compliance is a good thing.

uranusjr · July 8, 2020, 8:29am

I drafted a PEP to standardise the file name of the sdist that carries the most useful information (distribution name and version) in Draft PEP: File name of a Source Distribution. These information are highly unlikely to be changed during the build process, since existing tools (pip) already enforces this consistency, and any packages not following that should already not be working today.