Supporting sdists and source trees in PEP 665

brettcannon · November 11, 2021, 6:54pm

There’s no need to worry about that as people already have their own lock file format .

There’s no much we can do about that.

Any future PEP may change the requirements of PEP 665, so that door is always open.

pf_moore · November 12, 2021, 9:18am

One question I have over this approach is whether we have any feel for what proportion of potential PEP 665 users will find an implementation that only handles wheels sufficient for them. There’s a non-trivial number of pure-python packages that only ship sdists, and those packages wouldn’t be usable in PEP 665 as it stands. (Those packages typically don’t suffer from any of the issues being discussed here, but will end up being just as unsupported as a complex ML package with awkward build dependencies).

I don’t want to end up approving a PEP that is too limited to be actually useful…

(Having said this, I do support the current approach - as an installer maintainer, I’m very aware of the amount of additional complexity that sdist support would introduce, so keeping it independent and optional seems like the right choice).

sbidoul · November 12, 2021, 11:12am

This is more or less what I had in mind.

Are you open to exploring that approach further ? I think the main thing to investigate next is how to represent sdist and VCS reference - and possibly local directory references. I’d start from PEP 610 for inspiration.

brettcannon · November 12, 2021, 6:39pm

Yep, I’m open to it. A couple questions I would want to see answered are:

How does an sdist specify what the top of its built-time dependencies are? My current assumption is a build-requires key for the file entry.
How does one identify an sdist from a wheel?
Are sdists that don’t follow PEP 517 (and thus don’t specify their build dependencies) even allowed? If the answer is “no”, then the answer to question 2 (assuming my suggestion to question 1) becomes, “if build-requires is defined, it’s an sdist”.
How does one identify a source tree and are they treated any differently than sdist beyond how they are downloaded?
What do you do about runtime requirements for sdists? Do the sdists also have to follow core metadata 2.2 and not have Requires-Dist set to be dynamic so their runtime requirements MUST be covered by the lock file as well? Do they have to support PEP 621?
What do you do about runtime requirements for source trees since you won’t have the possibility of a PKG-INFO following core metadata 2.2? If you say, “the locker will have to build it”, then you already have your wheel, so why can’t you use the wheel instead of the source tree? Does the source tree have to use PEP 621?

I am also assuming that the PEP will say lockers and installers MAY support sdists, but it wouldn’t be a SHOULD/MUST recommendation. I would also be very tempted to say installers MUST make sdists an opt-in feature due to security concerns.

Ironically those packages are the easiest to build a wheel for locally and to use, as well as to end up with a reproducible build to get the same hash on any platform. In those instances you could have an out-of-band step to build and cache your wheels and then have your installer do it’s thing with the lock file while checking your local wheel cache, all while still having a secure hash.

brettcannon · November 12, 2021, 7:08pm

According to https://pythonwheels.com/, 342/360 (95%) of the top packages on PyPI have at least one wheel. A quick glance at the outliers suggest three of them may be difficult to compile (pyspark, thrift, and grpc-google-iam-v1), but the rest appear to be pure Python. For those that are still maintained we can probably work with the maintainers to help them build the wheel. For the dead/done projects, setting up a mirror of just pure Python wheels using the simple API should be doable.

Honestly, I love it if this became the impetus to get a build service for PyPI going.

sbidoul · November 13, 2021, 11:20am

A build-requires key is what I’m assuming too. It should specify all the build dependencies though, not only the “top”. So the installer can create a build environment with all dependencies pre-installed and error out if the build requires additional dependencies that were not locked in build-requires.

I’m not sure I understand this question. I don’t think anything particular has to be changed to PEP 665 for that. In the other thread I was simply suggesting to add marker and tag fields to the package entries to let installer select applicable entries based on well-defined fields instead of the file name (which as per other comments there might not be available if the URL is optional). That should be sufficient.

I see no reason to disallow them. Actually, in a way, all python projects follow PEP 517: “If the pyproject.toml file is absent, or the build-backend key is missing, the source tree is not using this specification, and tools should revert to the legacy behaviour of running setup.py (either directly, or by implicitly invoking the setuptools.build_meta:__legacy__ backend).”

I’m thinking of reusing the direct_url.json data structure from PEP 610 (with possible adaptations for toml ergonomy). This covers local directories - editable or not (relative path to be sorted out), and VCS references (for which the lock file must require and immutable hash in the commit_id field.

The locker has to run the metadata preparation steps. This means resolving the backend build-system.requires and dynamic build requirements (get_requires_for_build_wheel). The result of that goes to the build-requires section of the package._name_._version_ entry in the lock file. The locker has to run prepare_metadata_for_build_wheel and run the normal resolution process to add these to the runtime requirements in the lock file as it would do for a wheel.

So I don’t think we need to place any additional constraints on source trees and distributions, beyond what is allowed today by pip or build. PKG-INFO 2.2 and PEP 621 can benefit the locker for better performance but they are not mandatory.

Not building but preparing metadata, yes. This is, normally, cheaper than building a wheel.

Agreed

Agreed, for installers that are concerned about running code during the installation steps. But as long as the source artifacts have a hash or an immutable VCS commit hash, build dependencies are locked in the same way as runtime dependencies, so from that angle the security should be on a par compared to runtime requirements.

Let me elaborate the use case for VCS references. Assume you have made an upstream PR to a project you depend on. To use it in your application, you can add a VCS reference to your branch in your top level requirements. The lock file will preserve the commit hash. Other team member that have read-only access to the VCS repo can readily use it. Now if you want to lock it as wheel, you need to come up with a private version number, make an additional patch to use that version number, build the wheel, publish it to a private index - that you have to have and maintain, etc. This is clearly much more overhead and a heavier process.

fungi · November 13, 2021, 1:46pm

[…]

Not building but preparing metadata, yes. This is, normally,
cheaper than building a wheel.
[…]

Not entirely true. There are plenty of situations I run into where,
for whatever reason, pip is unable to build newer versions of some
dependency from sdist, and keeps trying them in reverse order until
it finds one which a wheel can be successfully built there. Metadata
alone is insufficient to determine which versions of a dependency
will be viable on a given platform, at least in practice, because
there are countless factors (e.g. external linked library versions,
compiler versions, system settings, and so on) that may influence
which actual version of a built wheel you end up with.

uranusjr · November 13, 2021, 2:15pm

Wheel metadata is designed to express compatibility without requiring the entire wheel being built. The reason that pip sometimes needs to build entire wheels is because those some packages do not efficiently construct their build system to achieve that, but instead do things the other way around and make metadata generation depend on binaries. This is unfortunately a people problem, and the only fix is to encourage the projects to optimise their build system. No amount of standards and tooling from us can fix it.

fungi · November 13, 2021, 4:07pm

I think it’s fine to say that such support is out of scope for
locker base requirements, but it’s not a good idea to pretend the
problem doesn’t actually exist. Working on projects with transitive
dependency sets numbering in the hundreds of different packages, I
see it with great frequency. It’s in some ways an emergent behavior
in the implementation of pip’s dep solver, in that it will silently
try older and older versions of an sdist if it can’t manage to build
a wheel locally, so the “lock files” projects I work in are creating
today do take it into account.

uranusjr · November 13, 2021, 8:23pm

I don’t understand your point, to be honest. Who pretended the problem doesn’t actually exist? The point I am trying to make is we can’t solve that problem with standards. How do you suggest us do? Surely you’re not suggesting we should go fix all those problematic build systems in the world for free.

fungi · November 13, 2021, 10:45pm

My point is that we should not assume lockfile generators,
particularly the early generations of them (for example, the ones I
use today in production… pip install && pip freeze > my.lock) will
be able to operate entirely on the basis of existing wheel metadata
published to PyPI or naively extracted from an sdist in some way
short of building a wheel. There are simply far too many packages
still published only as sdists, and in many cases haven’t seen
updates in a decade (yes, that’s a problem to be solved as well, no
doubt, the recent deprecations in SetupTools are shaking out a bunch
of them). In these cases, “just” generating metadata while trying
not to fully engage the package’s build process doesn’t provide an
accurate picture.

We can certainly have a standard which blindly assumes wheels build
everywhere, but that’s not our present reality. I’m fine with that,
I’ll just ignore its existence for another decade and continue using
pip freeze until the dependencies my projects have catch up with the
times, I guess.

stewartmiles · November 15, 2021, 6:47pm

Apologies for the slow response, I can see this has generated quite an interesting thread of conversation though

This sounds reasonable ok though it doesn’t play nice with pip build isolation turned off and in-tree builds turned on. So having the ability to resolve the entire graph - including build deps - would still be desirable.

Yeah I get it. But as you can see from this thread it’s clear there is the development use case where folks have packages installed from source as editable installs that isn’t being covered at the moment. The workflow for the current proposal would require all packages to be built as wheels and uploaded to a repository server (e.g a private repo) then deployment occurs from that repository server. This has knock on effects for continuous integration and developer workflow complexity. If you want to push the current proposal as is, seems ok but I do worry that lockers aren’t going to implement it unless it supports the use cases covered by their current lock file formats.

steve.dower · November 15, 2021, 8:23pm

FWIW, PEP 665 seems totally complete to me as a “desired end-state description” of the environment.

In that context, sdists don’t matter. They may need to be built to produce that desired end-state, but we already assume that a given name+version wheel is the same as the build output for the matching sdist, and so we can describe the end state with the name+version and leave it to the installer to do a build if necessary.

Editable packages are less clear, but I’d still be inclined to scope them out. They are usually intended for development, and desired state configurations are usually intended for deployment (either multi-deployments or reproducible deployments). They’re already a special case for live-linking (rather than copying) files that are not in an environment to make them look/act as if they are - that seems

Saying that these cases aren’t part of PEP 665 isn’t saying that they’re not important. It’s just saying that they’re not part of this solution, and if your problem is “I need reproducible builds” or “I need to set up a dev environment” then you need a different solution.

pf_moore · November 15, 2021, 9:20pm

Be careful what you assume here. At this point, it’s not certain that pip will support turning off build isolation while installing a lockfile (which will be a new pyp sync command, not pip install). This is because it’s not clear how installing a lockfile into a non-empty environment should behave (in terms of conflicts between what’s installed and what the lockfile specifies - remember that installing a lockfile doesn’t require a full resolve). So it’s quite plausible that we’ll simply require build isolation for pip sync where sdists are involved.

stewartmiles · November 15, 2021, 9:52pm

Yeah, I get it. It’s just potentially far slower when installing source distributions with build isolation as the same build packages can be installed over and again for each source package rather than just once in the venv. So while I agree it would be easier to force build isolation on, the speed difference to bootstrap an environment - in my current experience using a project with about 15 source packages that share build deps it can be an order of magnitude slower - may again lead to users using different solutions.

stewartmiles · November 15, 2021, 10:09pm

I really disagree with this statement. A developer ideally needs an environment that is as close to being the same as possible between development and deployment. If my development environment has a set of packages A while I’m iterating on a new package or set of packages and then I deploy with a lock file that points at packages B it’s a lovely chance for things to go wrong in unexpected ways. It’s possible for a developer to maintain two views onto the same set of locked packages, one for development and another for deployment but then we’ve got another set of configuration that can diverge. The workflow we have in place at our company is to use editable packages + pipenv a lock file to describe the development environment then we take the same lock file build the editable packages as wheels and inject references to them into the lock file then sync that for deployment. This allows us to have one configuration file that describes everything we need for development and deployment. Furthermore, since we’re using a graph it’s possible to sync subsets of the tree depending upon the deployment we need to use, e.g when sync’ing packages in a OSI container build.

As many folks have mention on this thread, PEP 665 looks like it moves in the right direction but still leaves a big gap for development scenarios which will likely result in lockers still doing their own thing to support those use cases.

pf_moore · November 15, 2021, 10:11pm

I’m fine with that. One thing I’d love to see is more choice in installers for PEP 665 lockfiles. One reason I want to keep the requirements for installers as minimal as possible is precisely so that people don’t have to reinvent the whole of pip to write a lockfile installer.

A lockfile installer that uses a single build environment for everything would be cool. Managing that environment might be complicated, but by using a lockfile as input, the installer can focus on that problem without getting sidetracked by needing to implement a resolver, for example.

steve.dower · November 15, 2021, 10:23pm

Almost by definition, if you’ve got a package in editable mode, it isn’t locked. So I think to make use of a PEP 665 with no support for editable packages, you’d switch your workflow to have a deployment set of packages that your developers can reproducibly install, and then replace the package you want to be editing as a second step*. There’s no need for it to be integrated into the lockfile.

(* Note that because this is outside the specification, tooling can make this easier for you however it likes. e.g. tools might take additional configuration to override locked packages with your editable one, or you may install the editable package first and the tools handle it by assuming that whatever is there meets the requirements of the lockfile already. But the spec gets to remain silent and leave it to tooling and workflows.)

stewartmiles · November 15, 2021, 10:29pm

If a package is editable it’s conceptually locked w.r.t the source tree. The developer has full control over the lock file (i.e the venv config) and any sources in the source tree, hence can trust editable packages they’re maintaining. It’s kinda like saying that building from source pinned to a commit isn’t locked when it’s clearly reproducible if your build tools are deterministic and are pinned to a specific revision.

I feel like I’ve made my point on this. If you’re providing a solution to sync a set of packages into a venv without a developer workflow in mind (e.g pulling a web server package that just requires some config / data files to use it), PEP 665 seems fine.

steve.dower · November 15, 2021, 10:57pm

ACK. Going to comment on some of your comments anyway, for the benefit of readers rather than to extend an argument. But I expect I’ll have made my point after this post, too.

Hard disagree. “Editable” literally means it’s only on your local machine. Editable modules are not transferrable between devices (including unrelated contexts on a single device), which is the primary (sole?) reason for packages, and certainly the only reason you’d want to lock anything.

True (with some mental/linguistic gymnastics, but true). However, since we get to define the point at which we lock, we can easily say that the lock (name+version+hash) is calculated at the wheel, without impacting reproducibility. All that really changes is that the steps for installers increase from “find a wheel file matching name+version+hash and extract it” to “… or an sdist that we can build and verify that its result is a wheel file matching name+version+hash and extract it” (since we have already decided that wheels are the standard installable element, and so sources go via wheels to be installed).

Again, it leaves installers with plenty of latitude to innovate and handle sources and editables however their users like. If we put them in the initial PEP, that innovation cannot happen, because it’s been dictated ahead of time. I also expect installers to have a range of options for weakening any reproducibility guarantees, such as ignoring hashes for built sdists or preferring already installed packages by name without verifying version or hash.

We shouldn’t have to dictate all of these behaviours ahead of time, provided we also don’t unreasonably limit tools ahead of time. (Virtually all of my arguments on this PEP have been about loosening these kinds of limits )