Supporting sdists and source trees in PEP 665

I don’t know if I agree with that actually. PEP 518 and 517 make no statement or claims that I can remember that this must be true. I’m not saying it isn’t desirable in some cases, but I don’t think it’s always necessary which shifts how critical it is that this be in v1 of this file format. For instance, a projects’ build tools may want a different version of ‘packaging’ than my project does (my guess is you’re thinking about some extension module like numpy that has a C API that you want to build against and thus also use at runtime).

Sure, if your locker isn’t doing this appropriately or you use multiple, manual steps to generate your lock files I can see things getting out of step with each other. But this is true for any situation where you don’t have a single lock file that covers all potential scenarios and we already decided that we are okay not supporting that.

Could you provide a concrete proposal so we have something to directly talk about? You would probably have to deal with:

  1. How to denote an entry is an “unsafe” sdist?
  2. How do you specify a build requirement for a specific sdist?
  3. How do you deal with conflicting build requirements compared to runtime requirements and how does that affect the installer’s work?
  4. Are we okay with installers having to support sdist building on top of what we are asking them to already do?
2 Likes

The ability to have relative editable paths is fairly key for me due to my current development working in a monorepo with multiple python packages that should be installed together, but we keep as separate packages for api boundaries. The inability to have editable/file installs and hashes in same file very recently ended up making me drop hashes as the lack of editable installs had often led to incorrect developer environments in my team.

For my use case I would be happy enough if only local/editable installs were permitted to be unhashed. The flexibility to mark any package as opting out of hashes would be nice flexibility, but I don’t have any use case for needing external packages unhashed. editable cannot be installed when requiring hashes · Issue #4995 · pypa/pip · GitHub is a long standing related issue of inability of pip to mix the two. When my options are secure but bad developer environment or insecure, the latter wins.

As for the solution of two files one lock file and one requirements file with editable installs that solution the main challenge is how do I make the lock? My dependencies come from my list of editable installs. My requirements.in looks like,

-e file:.
-e file:foo
-e file:libs/bar
-e file:libs/baz
-e file:foo2

If lock files can’t have editable installs then constructing lock file becomes messy to do for a monorepo and requires building my own small tool that unions requirements which feels like a tool that’ll slowly grow to deal with more packaging issues.

pip resolve option that takes an input with relative paths but excludes the relative paths would be one workaround as it’d allow making a valid lock file for my external dependencies. A broader version of it is if resolution could include packages that we want to exclude from the resolution output (pip-compile calls these unsafe packages like pip/setuptools).

1 Like

But if the code is changing that much and it’s internal code then I don’t see how a lock file benefits you? My understanding of what you’re saying is you really just want a way to list things to install which requirements files already cover as well as PEP 621. But this PEP is not meant to be a general solution for listing anything you may want to install, but to install specific versions of things in a deterministic, secure fashion.

I would prefer not to support that. If you need to install things from outside of the lock file then that’s fine, but I would rather make it so that whatever is listed in the lock file is considered secure from at least a data integrity POV and not water it down to become just a generic list of things to install.

1 Like

Is it not a reasonable assumption to make ?

When, for each source package in the lock file, the locker adds a list of pinned build dependencies, the installer can pre-install these when creating the isolated build environment for each, and then run the build with a kind of --no-deps option that will refuse to build if a required build dependency is absent.

Is it reasonable to say that the build dependencies must be wheels, to avoid boostrapping issues ?

I’m not quite sure I understand how that would work. For instance where would those separate lock files be stored ? Would project authors have to create a separate lock file for each of their dependencies that is not available as a wheel ?

If the list of pinned build dependencies is scoped to each source distribution in the lock file, and the installer builds sdists in an isolated environment, there is no conflict.

IMO, it could be fair to say that supporting sdist building is optional for installers.

But if the code is changing that much and it’s internal code then I don’t see how a lock file benefits you? My understanding of what you’re saying is you really just want a way to list things to install which requirements files already cover as well as PEP 621. But this PEP is not meant to be a general solution for listing anything you may want to install, but to install specific versions of things in a deterministic, secure fashion.

The lock file benefits me in keeping external dependencies locked. Each package in monorepo has it’s own list of dependencies. I need some way to produce a pinned/hashed requirements file that I can then install to keep external dependencies reproducible. If I have 3 internal packages today foo1, foo2, and foo3 there’s no direct way to produce a lock file of all of their dependencies without also including foo1/foo2/foo3 in the lock file. Installing things outside lock file isn’t an option since I lack any direct way to construct the lock file. Making a toy example let’s say I have these packages with these dependencies,

foo1 → X, Y, Z
foo2 → X, A
foo3 → Z
X → A, B
Y → None
Z → None
A → B

I would like to make a lock file off external packages of foo1/foo2/foo3. Something like,

X==V1 hash
Y==V2 hash
Z==V3 hash
A==V4 hash
B==V5 hash

I don’t see any way for any of the current tools to make that lock right now. I can make a lock that also includes foo1/foo2/foo3 with pip compile but then that file is unusable by pip due to the mix of editables/locks. So no editables in lock file is fine as long as there is a way to produce a lock file where resolver uses editables.

It is possible to work around by making a lock file with both hashes/editable, then making a script that removes editable, then installing that, and then installing editables afterwards. That’s a fairly messy solution and also means I can’t directly use normal pre-commit hooks like pip compile as the generated lock file they make needs post processing. Other workaround is building my own tool that concatenates setup.cfg/setup.py requirements and applies pip-compile to just that. Both of these workarounds boil down to make my own packaging mini-tool which I think most people that end up wanting editable + hashes will just give up and drop hashes.

edit: Part of the reason this issue is specific to monorepos with multiple packages is that for repository with exactly one package you can tell pip compile setup.py/pyproject.toml and it will produce a lock file of the dependencies of that package without including package itself. If you have multiple packages and you need a unified consistent environment then pip compile supports referring to each package including in a relative manner, but it’s not possible to do pip compile foo1/setup.py foo2/setup.py foo3/setup.py. So there’s a weird inconsistency here in that it’s easy to produce a lock file of dependencies for a package but harder to produce a lock of dependencies for multiple packages.

2 Likes

… all correct from what’s in my head. :wink:

I think that’s fair. I will update the PEP.

Maybe? :person_shrugging: I really don’t know. Some people do crazy things in their setup.py files.

That’s between you, the tool generating that lock file, and the tool doing the building. I got enough push-back from suggesting where lock files should go in terms of directories that I am not about to make that mistake again :smile:.

Probably. As of right now the PEP has no concept of scoping per-project, and that would be required to have separate build dependencies per sdist/source tree that don’t conflict with other projects or runtime dependencies. Maybe this is that one case where it makes sense to let a lock file point to another lock file to specify the build dependencies for an sdist (although I know @stewartmiles expressed concern about files getting out of sync with each other, but maybe if there were file paths linking them that isn’t as much of a concern?).

How would that look in the file? If people want this sort of thing supported in v1 then I think we need a proposal of what it’s supposed to look like from someone pushing for this. I have an idea, but it’s ugly and so I would rather see what others propose.

Otherwise I would rather get v1 landed and make sdist/source tree support a v1.1 thing for someone else to propose/push after seeing what solutions the community comes up with.

It would have to be optional IMO, else the simple installer and security story gets diluted too much. I will fully admit this PEP helps push the “wheels are good” story as much as possible and letting in sdists waters that down.

I would also say anything that goes in about locking build requirements would also have to be optional for lockers.

I don’t think that’s inherent to monorepos if you treat the individual packages in the repo as just that; separate, individual packages (and I used to work at Google, so I have lived the monorepo life and realize it can have its own issues when you view your entire code base as a series of snapshots instead of as individual units of stuff you pull together as needed at different points in the repo’s history).

From the way I’m reading it, the issue you’re having is you’re trying to treat the monorepo as a single thing to lock against, but still developing sections of it as independent units. That just doesn’t fit with the worldview this PEP is presenting. You may need to have your own tooling to make this all work by regenerating your lock files as dependencies change at whatever project granularity you have in your monorepo.

To be honest, this PEP might simply not be a solution for you (if I’m understanding what you are specifically asking for appropriately).

2 Likes

You are correct. We have multiple packages that define single application and want one lock for all packages together. We could merge packages into a single one, but the packages do hav notion of separate public/private interface. We don’t want package A depending on package B’s private interface and separate packaging is mainly used for API structuring and not for separation of deployment. We also want to allow outside teams to be able to depend on individual packages even though most of the developers that develop on those packages directly will need to cross boundaries frequently. Some of these requirements are competing and we have debated just merging packages into one.

I do wish there was a nicer way to handle this, but if answer is add support to tooling for my need that’s fair. Before I discovered this, I was working on adding a change upstream to pip-compile here to give me better control over resolution vs lock file (a way to exclude internal packages).

Are there things we need to change (instead of add) in PEP 665 to make supporting building from source possible? I’d much prefer we move all conversation about strict additions (in other words, changes without causing incompatibility concerns) to an entirely separate PEP and thread. It took us six years and 500+ messages to come up with PEP 660 (also noting that universal concensus was not reached even after all that), and I hope we could be able to use the lock file’s equivalent of PEP 517 first, instead of having to wait for everything to be ready in one shot.

4 Likes

I’m primarily thinking about the case where all source packages are editable installs in the virtual environment. For example, assuming I have a service that is called via grpc and serializes data with protocol buffers and flatbuffers I’ll have the following rough set of packages in my code repository:

  • my_code_generator
    install_requires: [grpcio-tools, flatbuffers] # grpcio-tools for protocol buffer
    build_requires: [setuptools, packaging, cmake] # cmake to build flatc from flatbuffers
  • my_wire_format
    install_requires: [protobuf, types-protobuf, flatbuffers] # runtime for generated serialization code
    build_requires: [my_code_generator]
  • my_service
    install_requires: [my_wire_format, grpcio] # grpcio runtime to call the service with the wire format
    build_requires: [my_code_generator]

Here I need to make sure that grpcio-tools (which depends upon protobuf) matches the same protobuf package as used in the runtime, a version skew can generate code that doesn’t work with the expected runtime. In this case all packages (my_*) are editable installs, i.e they’re stored in my source repository and installed in the virtual environment with -e path_to_package.

So given the situation above we need a graph that encompasses both runtime (install) and build-time dependencies. At the moment PEP 665 has package._name_._version_.requires, this would add package._name_._version_.build_requires where each requirement points to a package in the rest of the graph. This enables a locker to traverse the graph to start from an editable install (source package) and find all build dependencies / requirements and install them before installing the editable install which also validating the safety of each build dependency using the same hash comparison mechanism used for any package.

In response to your questions:

  1. How to denote an entry is an “unsafe” sdist?
    I’m not sure where the term “unsafe” came from here. Isn’t any source distribution that runs setup.py potentially unsafe since setup.py can do anything the user can do?
  2. How do you specify a build requirement for a specific sdist?
    As I mention above, package._name_._version_.build_requires.
  3. How do you deal with conflicting build requirements compared to runtime requirements and how does that affect the installer’s work?
    You don’t. This is the same problem as conflicting install dependencies, if they conflict you propagate the error back to the user who need to select a different package version. If a source distribution is being installed that requires a specific build package then the user can always build a wheel and use that to break the dependency conflict.
  4. Are we okay with installers having to support sdist building on top of what we are asking them to already do?
    pipenv already deals with this, I don’t follow how this makes things more complex.
1 Like

Yes, hence the “unsafe” label. The PEP currently only allows wheels which do not execute any code during installation. So bringing in sdists breaks this safety promise.

Instead of going with the assumption that everything must match, could you have essentially an array per sdist of its build dependencies? Like [[package.__name__.__version__.sdist-build-requirements.__name__.__version__]]? That would then duplicate what the PEP already specifies for the package.__name__.__version__ table, but scoped to the sdist of a specific project’s sdist. It’s a bit verbose, but it does mean you have a single lock file that encompasses everything, the locker can thus make sure e.g. grpcio-tools is consistent, while still allowing sdists to also have differing build requirements compared to your whole lock file’s runtime requirements.

If you look at PEP 665 – A file format to list Python dependencies for reproducibility of an application | peps.python.org you will see we have purposefully defined the semantics of an installer to not be nearly as complex as what pipenv has to implement. What’s specified in the PEP could be implemented with packaging, install, and the graph algorithm outlined in the PEP, and nothing else. That’s why sdist support would have to be optional.

@mdrissi is not alone here.

If we have a look on the pip-compile absolute links when given relative path · Issue #204 · jazzband/pip-tools · GitHub discussion we can see that this kind of usage is fairly popular in the community, and not an isolated practice.

In general people want to keep the dependencies locked, but work in the source code. One of the reasons for that is to avoid the “but it works on my machine” problem, between developers in the same team.

The risk of not handling it in the PEP (or a following one) is eventually someone coming up with their own lock file format, or re-using the same lock file, but omitting/ignoring the hash in some circumstances.

The requirement of having a mandatory hash is fine, for now, while editable installs are not covered.
However it would be nice to keep an open door for a future PEP that standardises the inclusion of editable sources in the lock file and relaxes this requirement.

Regarding source trees (not the editable ones), is that fair to assume that since replicable builds are not widely supported yet, it would be required to specify an algorithm to do “directory” hash of the source directory itself?

There’s no need to worry about that as people already have their own lock file format :wink:.

There’s no much we can do about that.

Any future PEP may change the requirements of PEP 665, so that door is always open.

1 Like

One question I have over this approach is whether we have any feel for what proportion of potential PEP 665 users will find an implementation that only handles wheels sufficient for them. There’s a non-trivial number of pure-python packages that only ship sdists, and those packages wouldn’t be usable in PEP 665 as it stands. (Those packages typically don’t suffer from any of the issues being discussed here, but will end up being just as unsupported as a complex ML package with awkward build dependencies).

I don’t want to end up approving a PEP that is too limited to be actually useful…

(Having said this, I do support the current approach - as an installer maintainer, I’m very aware of the amount of additional complexity that sdist support would introduce, so keeping it independent and optional seems like the right choice).

This is more or less what I had in mind.

Are you open to exploring that approach further ? I think the main thing to investigate next is how to represent sdist and VCS reference - and possibly local directory references. I’d start from PEP 610 for inspiration.

Yep, I’m open to it. A couple questions I would want to see answered are:

  1. How does an sdist specify what the top of its built-time dependencies are? My current assumption is a build-requires key for the file entry.
  2. How does one identify an sdist from a wheel?
  3. Are sdists that don’t follow PEP 517 (and thus don’t specify their build dependencies) even allowed? If the answer is “no”, then the answer to question 2 (assuming my suggestion to question 1) becomes, “if build-requires is defined, it’s an sdist”.
  4. How does one identify a source tree and are they treated any differently than sdist beyond how they are downloaded?
  5. What do you do about runtime requirements for sdists? Do the sdists also have to follow core metadata 2.2 and not have Requires-Dist set to be dynamic so their runtime requirements MUST be covered by the lock file as well? Do they have to support PEP 621?
  6. What do you do about runtime requirements for source trees since you won’t have the possibility of a PKG-INFO following core metadata 2.2? If you say, “the locker will have to build it”, then you already have your wheel, so why can’t you use the wheel instead of the source tree? Does the source tree have to use PEP 621?

I am also assuming that the PEP will say lockers and installers MAY support sdists, but it wouldn’t be a SHOULD/MUST recommendation. I would also be very tempted to say installers MUST make sdists an opt-in feature due to security concerns.

Ironically those packages are the easiest to build a wheel for locally and to use, as well as to end up with a reproducible build to get the same hash on any platform. In those instances you could have an out-of-band step to build and cache your wheels and then have your installer do it’s thing with the lock file while checking your local wheel cache, all while still having a secure hash.

2 Likes

According to https://pythonwheels.com/, 342/360 (95%) of the top packages on PyPI have at least one wheel. A quick glance at the outliers suggest three of them may be difficult to compile (pyspark, thrift, and grpc-google-iam-v1), but the rest appear to be pure Python. For those that are still maintained we can probably work with the maintainers to help them build the wheel. For the dead/done projects, setting up a mirror of just pure Python wheels using the simple API should be doable.

Honestly, I love it if this became the impetus to get a build service for PyPI going.

2 Likes

A build-requires key is what I’m assuming too. It should specify all the build dependencies though, not only the “top”. So the installer can create a build environment with all dependencies pre-installed and error out if the build requires additional dependencies that were not locked in build-requires.

I’m not sure I understand this question. I don’t think anything particular has to be changed to PEP 665 for that. In the other thread I was simply suggesting to add marker and tag fields to the package entries to let installer select applicable entries based on well-defined fields instead of the file name (which as per other comments there might not be available if the URL is optional). That should be sufficient.

I see no reason to disallow them. Actually, in a way, all python projects follow PEP 517: “If the pyproject.toml file is absent, or the build-backend key is missing, the source tree is not using this specification, and tools should revert to the legacy behaviour of running setup.py (either directly, or by implicitly invoking the setuptools.build_meta:__legacy__ backend).

I’m thinking of reusing the direct_url.json data structure from PEP 610 (with possible adaptations for toml ergonomy). This covers local directories - editable or not (relative path to be sorted out), and VCS references (for which the lock file must require and immutable hash in the commit_id field.

The locker has to run the metadata preparation steps. This means resolving the backend build-system.requires and dynamic build requirements (get_requires_for_build_wheel). The result of that goes to the build-requires section of the package._name_._version_ entry in the lock file. The locker has to run prepare_metadata_for_build_wheel and run the normal resolution process to add these to the runtime requirements in the lock file as it would do for a wheel.

So I don’t think we need to place any additional constraints on source trees and distributions, beyond what is allowed today by pip or build. PKG-INFO 2.2 and PEP 621 can benefit the locker for better performance but they are not mandatory.

Not building but preparing metadata, yes. This is, normally, cheaper than building a wheel.

Agreed

Agreed, for installers that are concerned about running code during the installation steps. But as long as the source artifacts have a hash or an immutable VCS commit hash, build dependencies are locked in the same way as runtime dependencies, so from that angle the security should be on a par compared to runtime requirements.

Let me elaborate the use case for VCS references. Assume you have made an upstream PR to a project you depend on. To use it in your application, you can add a VCS reference to your branch in your top level requirements. The lock file will preserve the commit hash. Other team member that have read-only access to the VCS repo can readily use it. Now if you want to lock it as wheel, you need to come up with a private version number, make an additional patch to use that version number, build the wheel, publish it to a private index - that you have to have and maintain, etc. This is clearly much more overhead and a heavier process.

[…]

Not building but preparing metadata, yes. This is, normally,
cheaper than building a wheel.
[…]

Not entirely true. There are plenty of situations I run into where,
for whatever reason, pip is unable to build newer versions of some
dependency from sdist, and keeps trying them in reverse order until
it finds one which a wheel can be successfully built there. Metadata
alone is insufficient to determine which versions of a dependency
will be viable on a given platform, at least in practice, because
there are countless factors (e.g. external linked library versions,
compiler versions, system settings, and so on) that may influence
which actual version of a built wheel you end up with.

Wheel metadata is designed to express compatibility without requiring the entire wheel being built. The reason that pip sometimes needs to build entire wheels is because those some packages do not efficiently construct their build system to achieve that, but instead do things the other way around and make metadata generation depend on binaries. This is unfortunately a people problem, and the only fix is to encourage the projects to optimise their build system. No amount of standards and tooling from us can fix it.

3 Likes