Supporting sdists and source trees in PEP 665

sbidoul · November 5, 2021, 10:23am

This is definitely in scope for my own pip-deepfreeze.

However an important use case for it is support for VCS requirements.

I understand there are security implications with supporting sdists and VCS references. Although in my understanding the implications come from the build systems rather than the source distribution itself. Since there would be hashes for the sdist and an immutable commit hash for VCS, the source would be vetted. But the end result could vary depending on build options, environmental conditions, and build backend dependencies that may or may not be pinned.

Nevertheless I think that supporting such use cases is important, as is the question of pinning build dependencies.

Let me also take the opportunity to thank you for your efforts on this topic!

uranusjr · November 5, 2021, 11:13am

One way I wish to tackle source trees (both sdist and VCS) is via reproducible builds. The reason PEP 665 does not cover these is because they can generate incompatible binaries, but the deeper problem is that we don’t currently have a way to identify when two source builds actually generated the same result because the wheel’s hash depends on a lot of variables including many that don’t (normally) affect reproducibility. If we can somehow figure out how to do reproducible builds, the lock file can simply pin the sdist to the expected build artifact’s hash, and reject the source tree if the built artifact does not match.

sbidoul · November 5, 2021, 11:32am

Reproducible builds could be great indeed but the road to achieve that might be long and tortuous. An intermediate step could be recording the source artifact hash (or VCS commit hash) and pin build dependencies, and let the user take conscious responsibility for making sure its build conditions are safe. That would be useful for many users.

stewartmiles · November 5, 2021, 11:08pm

Hey Brett,

As I mentioned on Interested in supporting PEP 665? · Issue #4825 · pypa/pipenv · GitHub it would be great to add support for build time requirements specified by pyproject.toml to support editable package installation:

Build-time requirement for editable installs in pyproject.toml should also be locked by the package locker / installer.
Build-time requirements need to be installed before editable package installs when using pip’s using pip’s PIP_NO_BUILD_ISOLATION=0 / --no-build-isolation option or when a build-time requirement is itself another editable package installation in the same virtual environment.

Also, if I’m following the spec correctly, package._name_._version_.url will not support relative paths (i.e the file:// scheme only accepts absolute paths AFAIK). It would be neat to support editable paths with path relative to the lock file. For example, one could have a virtual environment with:

lock.toml
my_editable_package/...
my_other_editable_package/...

where lock.toml references ./my_editable_package and as ./my_other_editable_package an editable installs using filename field and editable flag
similar to the Pipfile.lock format. While of course, editable packages don’t need hashes etc. since they’re installed from a trusted location in the developer’s source repository.

sbidoul · November 5, 2021, 11:19pm

Can’t the locker also pin dynamic build dependencies (i.e. what get_requires_for_build_wheel returns). Is there something else ?

I don’t follow this. If the source tree is pinned can’t we assume it will have the same static and dynamic build requirements in a subsequent build (at least when building on the same platform) ? In which case the build requirements will be available in the pinned build dependencies and the installer can blindly pre-install them in the build environment, trusting that the build requirements have been correctly resolved by the locker.

brettcannon · November 5, 2021, 11:19pm

The tricky bit for this is having to delineate what is a build dependency and what is a runtime dependency. Lockers would need keep to them separated so installers know what they need to resolve against. It would also require builders to understand this PEP in order to use the build dependencies for when they do their work.

Other than convenience of having more things in a single file, is there any reason you couldn’t have a separate lock file for your build dependencies? E.g. you could have a requirements.pylock.toml and a build-requirements.pylock.toml for the two separate purposes.

The PEP was specifically tweaked from it’s first draft to support relative file paths (hence why it only mentions file: and not file:// specifically). Or am I misunderstanding how file: works?

brettcannon · November 5, 2021, 11:27pm

Not that I know of, but I don’t think PEP 517 requires that list be consistent for the same platform or even between subsequent calls.

I think that would have be a choice of ours to say that’s an assumption being made as I don’t think anything currently restricts/requires this. If you have a build requirement that depends on your locale for some i18n reason then it could vary even on the same platform (e.g. needing some specific .po files).

As I mentioned to @stewartmiles , one possible solution is to have tools create separate lock files for sdists for their build requirements and use those lock files as inputs to build tools. Then @uranusjr gets reproducible builds as the hash value of the expected wheel that got built could be in the bigger, overall lock file.

stewartmiles · November 9, 2021, 12:07am

Yeah, build dependencies need to be resolved using the same graph as the rest of the virtual environment. One package’s build dependency (e.g A) can potentially be a runtime
/ install dependency of another package and really we want package A pinned the same version across each context. While it’s possible for a locker to generate a single graph of build and install packages then save out two lock files one for installed packages and another for build-time packages I can imagine this pretty easy to get out of sync.

brettcannon · November 9, 2021, 12:44am

I don’t know if I agree with that actually. PEP 518 and 517 make no statement or claims that I can remember that this must be true. I’m not saying it isn’t desirable in some cases, but I don’t think it’s always necessary which shifts how critical it is that this be in v1 of this file format. For instance, a projects’ build tools may want a different version of ‘packaging’ than my project does (my guess is you’re thinking about some extension module like numpy that has a C API that you want to build against and thus also use at runtime).

Sure, if your locker isn’t doing this appropriately or you use multiple, manual steps to generate your lock files I can see things getting out of step with each other. But this is true for any situation where you don’t have a single lock file that covers all potential scenarios and we already decided that we are okay not supporting that.

Could you provide a concrete proposal so we have something to directly talk about? You would probably have to deal with:

How to denote an entry is an “unsafe” sdist?
How do you specify a build requirement for a specific sdist?
How do you deal with conflicting build requirements compared to runtime requirements and how does that affect the installer’s work?
Are we okay with installers having to support sdist building on top of what we are asking them to already do?

mdrissi · November 9, 2021, 3:15am

The ability to have relative editable paths is fairly key for me due to my current development working in a monorepo with multiple python packages that should be installed together, but we keep as separate packages for api boundaries. The inability to have editable/file installs and hashes in same file very recently ended up making me drop hashes as the lack of editable installs had often led to incorrect developer environments in my team.

For my use case I would be happy enough if only local/editable installs were permitted to be unhashed. The flexibility to mark any package as opting out of hashes would be nice flexibility, but I don’t have any use case for needing external packages unhashed. editable cannot be installed when requiring hashes · Issue #4995 · pypa/pip · GitHub is a long standing related issue of inability of pip to mix the two. When my options are secure but bad developer environment or insecure, the latter wins.

As for the solution of two files one lock file and one requirements file with editable installs that solution the main challenge is how do I make the lock? My dependencies come from my list of editable installs. My requirements.in looks like,

-e file:.
-e file:foo
-e file:libs/bar
-e file:libs/baz
-e file:foo2

If lock files can’t have editable installs then constructing lock file becomes messy to do for a monorepo and requires building my own small tool that unions requirements which feels like a tool that’ll slowly grow to deal with more packaging issues.

pip resolve option that takes an input with relative paths but excludes the relative paths would be one workaround as it’d allow making a valid lock file for my external dependencies. A broader version of it is if resolution could include packages that we want to exclude from the resolution output (pip-compile calls these unsafe packages like pip/setuptools).

brettcannon · November 9, 2021, 7:19pm

But if the code is changing that much and it’s internal code then I don’t see how a lock file benefits you? My understanding of what you’re saying is you really just want a way to list things to install which requirements files already cover as well as PEP 621. But this PEP is not meant to be a general solution for listing anything you may want to install, but to install specific versions of things in a deterministic, secure fashion.

I would prefer not to support that. If you need to install things from outside of the lock file then that’s fine, but I would rather make it so that whatever is listed in the lock file is considered secure from at least a data integrity POV and not water it down to become just a generic list of things to install.

sbidoul · November 9, 2021, 10:20pm

Is it not a reasonable assumption to make ?

When, for each source package in the lock file, the locker adds a list of pinned build dependencies, the installer can pre-install these when creating the isolated build environment for each, and then run the build with a kind of --no-deps option that will refuse to build if a required build dependency is absent.

Is it reasonable to say that the build dependencies must be wheels, to avoid boostrapping issues ?

I’m not quite sure I understand how that would work. For instance where would those separate lock files be stored ? Would project authors have to create a separate lock file for each of their dependencies that is not available as a wheel ?

If the list of pinned build dependencies is scoped to each source distribution in the lock file, and the installer builds sdists in an isolated environment, there is no conflict.

IMO, it could be fair to say that supporting sdist building is optional for installers.

mdrissi · November 9, 2021, 10:50pm

But if the code is changing that much and it’s internal code then I don’t see how a lock file benefits you? My understanding of what you’re saying is you really just want a way to list things to install which requirements files already cover as well as PEP 621. But this PEP is not meant to be a general solution for listing anything you may want to install, but to install specific versions of things in a deterministic, secure fashion.

The lock file benefits me in keeping external dependencies locked. Each package in monorepo has it’s own list of dependencies. I need some way to produce a pinned/hashed requirements file that I can then install to keep external dependencies reproducible. If I have 3 internal packages today foo1, foo2, and foo3 there’s no direct way to produce a lock file of all of their dependencies without also including foo1/foo2/foo3 in the lock file. Installing things outside lock file isn’t an option since I lack any direct way to construct the lock file. Making a toy example let’s say I have these packages with these dependencies,

foo1 → X, Y, Z
foo2 → X, A
foo3 → Z
X → A, B
Y → None
Z → None
A → B

I would like to make a lock file off external packages of foo1/foo2/foo3. Something like,

X==V1 hash
Y==V2 hash
Z==V3 hash
A==V4 hash
B==V5 hash

I don’t see any way for any of the current tools to make that lock right now. I can make a lock that also includes foo1/foo2/foo3 with pip compile but then that file is unusable by pip due to the mix of editables/locks. So no editables in lock file is fine as long as there is a way to produce a lock file where resolver uses editables.

It is possible to work around by making a lock file with both hashes/editable, then making a script that removes editable, then installing that, and then installing editables afterwards. That’s a fairly messy solution and also means I can’t directly use normal pre-commit hooks like pip compile as the generated lock file they make needs post processing. Other workaround is building my own tool that concatenates setup.cfg/setup.py requirements and applies pip-compile to just that. Both of these workarounds boil down to make my own packaging mini-tool which I think most people that end up wanting editable + hashes will just give up and drop hashes.

edit: Part of the reason this issue is specific to monorepos with multiple packages is that for repository with exactly one package you can tell pip compile setup.py/pyproject.toml and it will produce a lock file of the dependencies of that package without including package itself. If you have multiple packages and you need a unified consistent environment then pip compile supports referring to each package including in a relative manner, but it’s not possible to do pip compile foo1/setup.py foo2/setup.py foo3/setup.py. So there’s a weird inconsistency here in that it’s easy to produce a lock file of dependencies for a package but harder to produce a lock of dependencies for multiple packages.

brettcannon · November 9, 2021, 11:33pm

… all correct from what’s in my head.

I think that’s fair. I will update the PEP.

Maybe? I really don’t know. Some people do crazy things in their setup.py files.

That’s between you, the tool generating that lock file, and the tool doing the building. I got enough push-back from suggesting where lock files should go in terms of directories that I am not about to make that mistake again .

Probably. As of right now the PEP has no concept of scoping per-project, and that would be required to have separate build dependencies per sdist/source tree that don’t conflict with other projects or runtime dependencies. Maybe this is that one case where it makes sense to let a lock file point to another lock file to specify the build dependencies for an sdist (although I know @stewartmiles expressed concern about files getting out of sync with each other, but maybe if there were file paths linking them that isn’t as much of a concern?).

How would that look in the file? If people want this sort of thing supported in v1 then I think we need a proposal of what it’s supposed to look like from someone pushing for this. I have an idea, but it’s ugly and so I would rather see what others propose.

Otherwise I would rather get v1 landed and make sdist/source tree support a v1.1 thing for someone else to propose/push after seeing what solutions the community comes up with.

It would have to be optional IMO, else the simple installer and security story gets diluted too much. I will fully admit this PEP helps push the “wheels are good” story as much as possible and letting in sdists waters that down.

I would also say anything that goes in about locking build requirements would also have to be optional for lockers.

I don’t think that’s inherent to monorepos if you treat the individual packages in the repo as just that; separate, individual packages (and I used to work at Google, so I have lived the monorepo life and realize it can have its own issues when you view your entire code base as a series of snapshots instead of as individual units of stuff you pull together as needed at different points in the repo’s history).

From the way I’m reading it, the issue you’re having is you’re trying to treat the monorepo as a single thing to lock against, but still developing sections of it as independent units. That just doesn’t fit with the worldview this PEP is presenting. You may need to have your own tooling to make this all work by regenerating your lock files as dependencies change at whatever project granularity you have in your monorepo.

To be honest, this PEP might simply not be a solution for you (if I’m understanding what you are specifically asking for appropriately).

mdrissi · November 9, 2021, 11:55pm

You are correct. We have multiple packages that define single application and want one lock for all packages together. We could merge packages into a single one, but the packages do hav notion of separate public/private interface. We don’t want package A depending on package B’s private interface and separate packaging is mainly used for API structuring and not for separation of deployment. We also want to allow outside teams to be able to depend on individual packages even though most of the developers that develop on those packages directly will need to cross boundaries frequently. Some of these requirements are competing and we have debated just merging packages into one.

I do wish there was a nicer way to handle this, but if answer is add support to tooling for my need that’s fair. Before I discovered this, I was working on adding a change upstream to pip-compile here to give me better control over resolution vs lock file (a way to exclude internal packages).

uranusjr · November 10, 2021, 3:56am

Are there things we need to change (instead of add) in PEP 665 to make supporting building from source possible? I’d much prefer we move all conversation about strict additions (in other words, changes without causing incompatibility concerns) to an entirely separate PEP and thread. It took us six years and 500+ messages to come up with PEP 660 (also noting that universal concensus was not reached even after all that), and I hope we could be able to use the lock file’s equivalent of PEP 517 first, instead of having to wait for everything to be ready in one shot.

stewartmiles · November 10, 2021, 6:24pm

I’m primarily thinking about the case where all source packages are editable installs in the virtual environment. For example, assuming I have a service that is called via grpc and serializes data with protocol buffers and flatbuffers I’ll have the following rough set of packages in my code repository:

my_code_generator
install_requires: [grpcio-tools, flatbuffers] # grpcio-tools for protocol buffer
build_requires: [setuptools, packaging, cmake] # cmake to build flatc from flatbuffers
my_wire_format
install_requires: [protobuf, types-protobuf, flatbuffers] # runtime for generated serialization code
build_requires: [my_code_generator]
my_service
install_requires: [my_wire_format, grpcio] # grpcio runtime to call the service with the wire format
build_requires: [my_code_generator]

Here I need to make sure that grpcio-tools (which depends upon protobuf) matches the same protobuf package as used in the runtime, a version skew can generate code that doesn’t work with the expected runtime. In this case all packages (my_*) are editable installs, i.e they’re stored in my source repository and installed in the virtual environment with -e path_to_package.

So given the situation above we need a graph that encompasses both runtime (install) and build-time dependencies. At the moment PEP 665 has package._name_._version_.requires, this would add package._name_._version_.build_requires where each requirement points to a package in the rest of the graph. This enables a locker to traverse the graph to start from an editable install (source package) and find all build dependencies / requirements and install them before installing the editable install which also validating the safety of each build dependency using the same hash comparison mechanism used for any package.

In response to your questions:

How to denote an entry is an “unsafe” sdist?
I’m not sure where the term “unsafe” came from here. Isn’t any source distribution that runs setup.py potentially unsafe since setup.py can do anything the user can do?
How do you specify a build requirement for a specific sdist?
As I mention above, package._name_._version_.build_requires.
How do you deal with conflicting build requirements compared to runtime requirements and how does that affect the installer’s work?
You don’t. This is the same problem as conflicting install dependencies, if they conflict you propagate the error back to the user who need to select a different package version. If a source distribution is being installed that requires a specific build package then the user can always build a wheel and use that to break the dependency conflict.
Are we okay with installers having to support sdist building on top of what we are asking them to already do?
pipenv already deals with this, I don’t follow how this makes things more complex.

brettcannon · November 10, 2021, 11:26pm

Yes, hence the “unsafe” label. The PEP currently only allows wheels which do not execute any code during installation. So bringing in sdists breaks this safety promise.

Instead of going with the assumption that everything must match, could you have essentially an array per sdist of its build dependencies? Like [[package.__name__.__version__.sdist-build-requirements.__name__.__version__]]? That would then duplicate what the PEP already specifies for the package.__name__.__version__ table, but scoped to the sdist of a specific project’s sdist. It’s a bit verbose, but it does mean you have a single lock file that encompasses everything, the locker can thus make sure e.g. grpcio-tools is consistent, while still allowing sdists to also have differing build requirements compared to your whole lock file’s runtime requirements.

If you look at PEP 665 – A file format to list Python dependencies for reproducibility of an application | peps.python.org you will see we have purposefully defined the semantics of an installer to not be nearly as complex as what pipenv has to implement. What’s specified in the PEP could be implemented with packaging, install, and the graph algorithm outlined in the PEP, and nothing else. That’s why sdist support would have to be optional.

abravalheri · November 11, 2021, 5:26pm

@mdrissi is not alone here.

If we have a look on the pip-compile absolute links when given relative path · Issue #204 · jazzband/pip-tools · GitHub discussion we can see that this kind of usage is fairly popular in the community, and not an isolated practice.

In general people want to keep the dependencies locked, but work in the source code. One of the reasons for that is to avoid the “but it works on my machine” problem, between developers in the same team.

The risk of not handling it in the PEP (or a following one) is eventually someone coming up with their own lock file format, or re-using the same lock file, but omitting/ignoring the hash in some circumstances.

The requirement of having a mandatory hash is fine, for now, while editable installs are not covered.
However it would be nice to keep an open door for a future PEP that standardises the inclusion of editable sources in the lock file and relaxes this requirement.

abravalheri · November 11, 2021, 5:41pm

Regarding source trees (not the editable ones), is that fair to assume that since replicable builds are not widely supported yet, it would be required to specify an algorithm to do “directory” hash of the source directory itself?