PEP 665: Specifying Installation Requirements for Python Projects

dstufft · August 30, 2021, 8:28pm

I had been trying to stay out of this, but a few people have recently come and asked about it, and that’s caused me to start diving into this more.

I think there are a few problems here, as I see it after reading the PEP and the discussion here, so I’m going to call them out.

Issues

1. PEP 665 does not standardize “Lock” Files

The body of the PEP regularly refers to lock files, but I don’t actually think that this PEP standardizes a lock file. It standardizes a format that, if the emitter is careful, could represent a lock file, but it also could be something completely different besides that. As @pf_moore mentioned, it’s perfectly legal within this PEP to simply copy the entirety of PyPI into the packages data, and that’s fully legal, and that certainly doesn’t fit with anybody’s definition of what a “lock” file is. I also think that this adds to the confusion around whether or not an installer consuming one of these files needs to have a dependency solver or not.

I think you’re getting a weird amalgam of features and mixed messaging because the terminology doesn’t match what you’re actually specifying, and some of the features don’t make sense in that context.

If we take a step back, I think what we actually want here is a successor to requirements.txt, which can actually slot in and be usable for a number of use cases, one of which is as a lock file format, but it’s also a much more general format overall.

My suggestion would be to rewrite this to remove most or all of the references to lock files (except as an example of where you might use this) and call this a file to specify the creation/dependencies of a python environment. You might use this in a Python project, you might use it completely independent of that, but ultimately what you’re describing is how to create an environment that you need. That environment might specify wide open version specifiers, or it might specify locked down specifiers, but either way you’re defining what needs to be installed into an environment.

2. Do not require no network installation support

I think that it’s a fine goal to support no network installations that minimize the risk of a repository changing causing breakages. However I think that it is a mistake to mandate this. There are a lot of cases where you don’t care or don’t want that, but you still want to create an environment (this also feeds back into the first item).

Thus I would make the listing of items in the package field optional, and also add a field for listing sources that specific packages could be searched for in.

If you do that, you could maybe specify that if a particular dependency is in the package array you need to use the data in that, but I would say that it should rather be strongly recommended, rather than required to enable some possible interesting edge cases.

3. No network installation support should have untouched data to resolve with.

This could also be roughly renamed to package.<name>.needs is the wrong abstraction.

I think a fair amount of confusion comes into place with the treatment of this field. It’s been suggested that this should be unmodified from the package itself, it’s also been suggested that the tools that emit this file should also munge this data to turn it into == dependencies (or maybe something in between?).

I think that to support the full range of things, you’re going to, at a minimum, want to reproduce the exact dependency information from the artifact itself. You might also want to specify additional constraints, generated by the tool that emits this, but I personally think that should be in it’s own section somewhere. Maybe under metadata, though I think this is a seperate concern then metadata.needs, which to me represents the human intent, so maybe metadata.constraints, which are intended to further constrain any resolution that occurs, without causing something to be installed otherwise.

I think this also helps with a use case that the PEP currently fails on: It suggests to implements locking largely by limiting the files that it lists in packages, but that makes a fairly large assumption that the URLs from the repository are static, but that is not something that any of our PEPs require of a repository (and I am aware of some internal PyPIs at some companies that use file URLs on their index pages which have temporary tokens to enable authentication embedded in them). Moving locking to a “constraints” field means that you divorce locking and no network installs, which means situations like that still continue to work.

4. Ignoring the Inputs / Human Write-able side means you’re not actually solving for dependabot et al

As @njs mentioned, tools like Dependabot don’t just care about the outputs of a lock, but also on the inputs. I think you should either remove them from justification OR you should say that this file is also human writeable, and define a naming convention for the locking process.

For instance, you could imagine the specification saying that the package field is not intended for human writing, but that the rest of the fields are. Then you could have a work flow where you have rtd.INPUTPLACEHOLDER.toml that is human editable, and generally will only contain metadata.needs, but maybe metadata.marker or metadata.tags or metadata.constraints, and we define that *.INPUTPLACEHOLDER.toml is the input to something like *.OUTPUTPLACEHOLDER.toml.

The biggest problem I see with this, which, probably makes it a non starter IMO, is focusing on the output means that you can simplify the features you have to support, but if you also focus on the input, that means you have to start defining how a particular input gets compiled into an output, and you either lose the ability to get differentiation in tools and all features have to be “baked” into the spec (e.g. if I want to include multiple files to mix and match things, the spec would have to support that) or you end up where if you use the wrong tool, you get broken output (possibly subtly broken so, imagine the Openstack case where I have some default constraints that are externally managed, so i want some form of an include, if we let that feature be tool specific and you used the wrong tool, you’d just get no user supplied constraints).

The other problem is if you don’t deal with inputs, you basically can’t modify this file at all in any sort of automatic or agnostic fashion, you can only consume the information it presents. For instance, say VS Code grows support for this hypothetical file and we don’t define the inputs. A reasonable feature VS Code might want to add is the ability to bump the version of something (either in constraints or in needs), but without modifying the input to this file, it’s likely that when the next user comes along and uses the tool that originally emitted this file, which doesn’t know about this change ends up blowing away that change VS Code (or Dependabot, or whatever) made.

So we’re going to have to decide a trade off between all the problems that come along with dictating the input to this file OR stating that anything that needs to modify this file in a tool agnostic way is simply not supported, and you can only read this file.

Whatever trade off is decided on, the PEP should be updated to remove the justifications that don’t actually work (e.g. if it’s decided to only allow reading, then mentioning VS Code generating lock files needs to be removed).

5. Needlessly ties implementation to `pyproject.toml`, and a particular directory structure.

There are lots of reasons why you might want to create a Python environment (locked or not), and this PEP makes the assumption (at least in the Rationale) that the input is going to be a ppyproject.toml, but that feels wholly wrong to me. Environments might coexist with a python project, but they might also not have anything to do with a Python project (ex. static blog made using Hugo, which is written in Go, using Fabric to upload to a remote server. I want to have a Python environment, but I do not have a Python project).

I suggest removing any references to pyproject.toml, and I would go one step further and suggest that we should discourage using pyproject.toml as the input file completely. It creates the same kind of confusion that people have with setup.py and requirements.txt. Defining an environment and defining a project are two different tasks, and should not share an input/source format.

This would also mean that the pyproject-lock.d directory needs to change, and honestly I would just get rid of this concept completely. It feels completely unneeded, and largely like specifying something for the sake of specifying it (plus it’s relation with pyproject.toml, which as I said is wrong IMO). To enable discovery I would just define an extension, which is the most common way to handle file discovery, and to enable out of the box syntax highlighting I would make it a two part extension, like .lock.toml or .env.toml or something.

6. Level of abstraction for package table is wrong

This PEP makes the assumption that different artifacts of the same version will have the same metadata. This is an invalid assumption with Python’s metadata as it stands today. This data needs to be broken down per file or it is fundamentally incompatible with the entire vast bulk of software out there. As @dustin mentioned, PyPI made this mistake and it’s been a todo list item for a long time to finish it. PyPI mostly gets away with it because that data that is wrong isn’t being used anywhere “important” (it’s the JSON API and the Web UI, neither of which get consumed by installers), but I suspect it would be a much larger problem if we were feeding it into the resolution algo.

You can argue that those people are “doing it wrong”, but the fact of the matter is it’s a pretty simple structurally change to fix it (and afaict most installers already treat this data on a per file basis anyways, so they’d just end up synthesizing it anyways) to reduce a lot of potential frustration.

In the PEP you mention:

Luckily, specifying different dependencies in this way is very rare and frowned upon and so it was deemed not worth supporting.

However you don’t provide any data to back up that claim. I would guess that it is not rare, given that was traditionally the correct way of doing so, and in most cases it never stopped working for people. Most people, in my experience, don’t regularly go around updating their packaging until something breaks so I suspect that there are a lot of projects out there doing just that, simply because that used to be the way to do it and it never broke for them before. Just randomly picking names from the top 100 downloaded projects from PyPI it took me 5 tries to find one that does it (psutil, which actually uses a conditional on PyPy or not to add additional extras which themselves use marker syntax). I’m sure if I spent more than 5 minutes looking I would find more.

Overall, it seems like a bad hill to die on to me. I tend to view the entire ecosystem as having a limited “breakage budget”, and this doesn’t seem like something worth spending against that budget for.

7. Some data for resolution is missing

The list of files doesn’t need to contain python-requires, but it needs to, it’s a layer of data that needs to be considered during resolution. This feeds back into 6 above (and on PyPI this data is properly file specific).

8. Hashes only supports one kind of hash

This is somewhat nitpicky, but it would be really nice if hashes was a table instead of two individual keys. That will make possible future migrations to new hashes much easier as we can just include a new key in the table alongside the old key.

9. Items installed through this should not be direct URLs unless they were, in fact, actually direct URLs.

This PEP currently says that anything installed here should be marked as a direct URL, but that feels wrong to me. Just because you’ve precomputed some parts of the resolution, doesn’t mean that those files were not originally from a particular repository and they’re now direct URLs.

In my opinion, only things which were originally specified as a direct URL, should be marked as a direct URL.

10. More explicitly state that it’s ok for installers to support a subset of features available here

The PEP alludes to this by saying:

Installers MUST error out if they encounter something they are unable to handle (e.g. lack of environment marker support).

But I think that it would be better if it was explicitly called out that installers are free to support a limited subset of features here to enable installers that can enforce certain constraints (e.g. an installer that does no “real” resolution, and anytime it traverses the dependency tree it just blindly accepts whatever is listed in constraints or errors if an item isn’t explictly listed in constraints or constrains contains a non exact pin).

11. Versioning is too strict and/or an integer doesn’t contain enough information

A monotonically increasing integer for version means that every change has to either be considered backwards incompatible OR every change needs to be considered backwards compatible OR you only rev the version on backwards incompatible changes and do nothing for backwards compatible changes.

My experience with packaging suggests that backwards compatible changes are far more likely and common, but that backwards incompatible changes are not unheard of. Thus it is extremely useful to be able to have a signal for both. For instance, adding a new key to the file is a very likely future update, for instance if we don’t solve 8 above now, in the future we would possibly need to add support for what I suggested in 8. We could do that in a backwards compatible way easily, but the versioning scheme in use here doesn’t afford that capability unless we just add it and ignore it.

Generally I would recommend not exactly semver, but a two part version, major.minor only. Tooling should error out if they get a major version they do not understand, but they should only generate a warning if they get a minor version they do not understand.

12. Needlessly breaking out the triplet on wheels?

Under the “code” tag, is there a reason for breaking out the platform triplet from the filename into dedicated keys? That just seems like you’re inviting bugs where the broken out values and the filename don’t agree since you’re putting that data in the same place twice. It doesn’t even save the installer from implementing the code to extract that information from the wheel filename since those tags are optional and the PEP mandates that the installer MUST be able to fall back to extracting from the filename itself… so it seems like it actually just complicates reading these files?

13. Bikesheds

I’m not a fan of the new terminology of “needs”, like others we already have the “requires” terminology and changing it seems like churn for no reason.

I’m also not a fan of “code”, it should probably be file or artifact or distributions. Code is ambiguous in that some could take it to mean the repository the code lives in (e.g. why would you use “code” for a compiled C extension) and not all artifacts contain any code at all.

Summary

Overall I think there is something here that could be a viable replacement for at least part of requirements.txt, but as it stands it feels like it’s sitting in a really weird place where it is trying to be a lock file, but then some tools implement “lock files” that aren’t actually lock files (and I have serious doubts that those tools are actually producing correct multi platform lock files, but that’s neither here nor there) so you started adding additional features, so you’ve ended up with a weird frankenstein that isn’t either a traditional lock file OR a particularly good replacement for requirements.txt.

I think if you make the changes I outlined above, but more specifically 1-5, you’ll end up with a much more consistent, and flexible result that can both be used for generated lock files AND for other more interesting use cases.

I also think this flexibility more accurately reflects the intent of “common basis, but not a ceiling for functionality”, as breaking apart some of these intermingled features so you can mix and match them affords tooling a lot more ability to create interesting new combinations of features.

Sorry for the wall of text!