PEP 665: Specifying Installation Requirements for Python Projects

Nice PEP, thank you for working on this.

It would be really useful to include a bit of context about assumed or recommended usage patterns of lock files. The only thing I see right now is in the backwards compatibility section, which says lock files can but don’t have to be checked into version control. Related questions:

  • Can/should lock files be included in an sdist and/or a wheel?
    • If a lock file is included, does it do nothing, does it override pyproject.toml, or can it be opted into via an installer?
  • Are lock files only for standalone/top-level projects (applications, dev environments, etc.) or also for libraries that are dependencies for other end user facing functionality?

If there is a good reference with more extended discussion, that would be good to link to as well. My impression after reading this PEP is that I should somehow know exactly what a lock file is for and when to use it because I have already used it elsewhere.

Are usage patterns of lock files in all those languages the same, and things work smoothly - or are there pain points that this PEP has taken into account?

2 Likes

This is not in the PEP (and should), but my understanding when working on the PEP is

  • The lock file is only for standalone project environments and should not be included in an sdist or wheel.
  • If it is in a wheel, it has no effect (the wheel’s dependency information is described solely by METADATA).
  • If it is in an sdist, the behaviour technically depends on the sdist’s build backend. We strongly encourage the backend to ignore the file.

This is the approach taken by all lock file usages across languages from my understanding. Other than this, languages and their underlying packaging stack differ too much for most pain points to be meaningful for Python, so we did not really addressed much of the pain points found in those other communities face (because most of them don’t make sense in Python), but tried more to identify and fix where their solutions are incompatible with Python packaging. From the top of my head:

  • Maybe lock file designs pre-defined dependency “groups”, but it’s clear that many Python users expect to have a lot more freedom on how many environments they can use and combine for a project (-r another file from a requirements.txt), hence the multi-file directory approach.
  • Python packaging provides some rather unique functionalities to specify dependencies (PEP 508 direct URL, pip’s ability to install a relative path, swap out the main index entirely, etc.), so there are quite some extra information encoded in the file to support those.
  • Python users expect to be able to only copy the lock file somewhere (not the file that produced the lock file) and have it “just work”. Most languages expect you to either copy the entire project directory or at least the original user input (packages.json, Cargo.toml, etc.). This requirement to make the file work entirely on its own is also somewhat special.

Yes, it’s technically redundant, but is very useful when upgrading and removing packages in a lock file. The idea is the locker can use it to find whether a package entry becomes dangling; say if b only has a in its required-by, and a got removed, then I can safely remove the entry b and look for b in other entries’ required-by. Without the field, the locker will need to reconstruct the entire tree top-down, which is relatively expensive for Python packaging than other languages.

Brett’s our de-facto arbiter for naming issues :stuck_out_tongue:

I may be missing something, but it seems this is achievable with the current url and needs fields, right? It’s definitely an interesting idea we didn’t think of.

3 Likes

I gree that there’s a bit of a weird asymetry between needs and required-by. needed-by is more intuitively the other side of a needs relationship.

1 Like

For background, perhaps it would be useful to mention that since lock files describe “the environment”, you cannot meaningfully combine different lock files (e.g. of different projects) together. If you need that, fall back to (unpinned) dependencies (and perhaps build a new lock file out of that).
So:

  • While lock files can be useful for library developers (e.g. for test setups or doc builds), they’re useless for users of the library.
  • Applications can (and should) use lock files for deployment. However, providing “traditional” (unpinned) dependencies is useful as well – both for building lock files and to enable installing into other environments (with the understanding that those environments need their own integration testing).
4 Likes

I think that needs and required-by need to have the same verb. Either needs and needed-by, or (my preference) requires and required-by. I prefer requires over needs because its how the python community at large already refers to requirements. They are “requirements” and typically listed in either: requirements.txt, install_requires / extras_requires (in setup.py). I think that changing the nomenclature to needs is rather pointless at best, and at worst it could introduce confusion to new python programmers when they are trying to figure out how this relates to requirements.txt and install_requires in existing projects.

After reading this pep, I am also not sure how it handles the situation where one package is both a top level dependency, and also a dependency for another top level dependency. For example, imagine a project that uses flask and jinja2 directly. flask also has a top level dependency on jinja2. So, if I understand the pep correctly, we’d end up with a file that contained these sections (omitting non relevant sections to my question):

[metadata]
needs = ["flask", "jinja2"]

[[package.flask]]
needs = ["jinja2>=3.0", ...]

[[package.jinja2]]
required-by = ["flask"]

I can’t see why this would be a problem, but in light of Tzu-ping Chung’s comment:

I think it would be beneficial to explicitly state the hierarchy of dependencies appearing as both top level and sub dependencies. Otherwise an overeager implementation of this particular feature by any lockers could potentially result in them losing some top level dependencies in rare situations.

One last thing is I don’t understand why this should support having multiple different package.<name>.code tables for a single package, like the example does for mousebender. In such a situation, how does an installer pick which code to use? The pep seems to implicitly state that it prefers wheel types over any others when it states that installers may choose to refuse to install all types other than wheel, but at the same time the pep doesn’t actually say that wheel type code is preferred if multiple are found for the same package.

Additionally, since the motivation of this pep is to have a way to specify reproducible builds, it would stand to reason that if a package has multiple code blocks, they should all contain the exact same source code? If this is true, then why would you need multiple code blocks? If this is not true, then the build is not reproducible if installers are allowed to pick any of the code blocks to install the package from. One installer could produce a different build than another, which appears to violate the motivation behind this pep. I think it would be good to either require a single package.<name>.code block per package, or to specify a mechanism to indicate inside the file which code block is preferred. Potentially something along the lines of:

[[package.mousebender]]
version = "2.0.0"
needs = ["attrs>=19.3", "packaging>=20.3"]
preferred_code_type = "sdist"

Or, changing the specification for package.<name>.code so that the tables are not stored in lexicographic order, but instead stored in preferred install order. So installers would just have to use the first code block they found, and if for some reason the installer refuses to use that one, then it can continue to the next one.

1 Like

Thanks for pointing this out. My comment was only meant to illustrate the idea behind required-by, but I failed to communicate that it’s only the basic idea, not the complete logic to do that—in general, you need not required-by itself, but the sections referenced in it (and the top-level dependency specification, as you pointed out). The idea is to limit what needs to be traversed when a locker only intends to modify a part of the graph, which is a common operation for tools like Dependabot (to upgrade a dependency away from a vulnarable version), and not currently handled well by tools like pip and its derivatives.

This reminds me though; in my old proposal back in 2019, I used an empty string to designate the top level dependencies. This never caught on back then, but I’d definitely not object if we bring it back :slightly_smiling_face:

[[package.jinja2]]
required-by = ["", "flask"]  # Jinja2 is specified by the user, and is also depended by Flask.

In a perfect world, reproducibility means running the same code, with the same dependencies, on the same Python interpreter, in the same runtime environment. And that’s definitely a popular and valuable definition (also why containers are so popular nowadays). In reality though, projects tend to come with some level of assumptions and want to bend the rules. The runtime environment does not need to be exactly the same, as long as we don’t config the environment in a way that significantly impacts the behaviour (a popular definition for web services). Maybe the application can run on multiple operating systems (and/or architectures) and they don’t need to all install the same wheel, as long as the dependency’s maintainers promise all those wheels behave the same, so we can develop on macOS and deploy to Ubuntu. In extreme situations, even a dependency’s version doesn’t need to be exactly the same, as long as only one unique resolution is possible for each platform, and the different resolved versions don’t affect the end application’s behaviour. These are all feature requests that came up during development of existing tools, and PEP 665’s design. You can argue some of those are bad practice (I don’t think PEP 665 handles that final example without workarounds, for example), but those are popular enough that we are convinced they need to be possible, otherwise the format won’t be able to catch on and we’ll be back to square one.

3 Likes

Thanks for getting this written up! I know it’s been a big task. Just a few questions/comments:

  • Could you add a short summary putting this file in context given today’s de facto state of the world? e.g. “this is essentially the intermediate data generated by pip’s resolver in between parsing a requirements.txt/set of requirements, and just before it starts downloading/installing packages”. I think that will help place it correctly, in that this replaces neither of those steps, but actually separates them in a way that other tools can participate in either half of the process without having to do both.
  • Why require a directory? That seems to presume more about project structure than we should or should need to? (I guess it’s to make it easier for tools to be able to automatically build from a repo, but since they can’t pick the right lock file automatically anyway there’s nothing gained from the location being fixed.)
  • What should consumers (installers) written today do if the version field is not 1? We need to tell them now to either accept, accept-and-warn, accept-and-warn-on-unknowns, or fail fast. Even if the next version is mostly compatible, there’s a chance it may not be. (This would actually be the reason to bring back SemVer, so that we can make compatible-for-parsing changes without breaking existing tooling, but still have the ability to break it if needed.)
  • Where file:// is mentioned, could it be file: instead and allow relative paths? Main application would be to create a bundle of wheels and lock files where the whole bundle is deployed/downloaded and then the right lock file is selected on install. (The variables would also work, but “relative to the lock file location” is also valid. This scenario also gets weird given the arbitrary directory name requirement I mentioned above.)
  • type="source tree" is the space really necessary? Would type="sources" suffice?
  • Any consideration for having a requested_version field in package references (alternatively, “constraint”)? This is along the lines of what Conda does, which would allow using a lock file to store an environment spec that can be updated later without needing to find the original source again. The needs fields are essentially this, but seem unnecessarily indirect compared to just putting the requested constraint on the package table.
1 Like

I’m going to do my best to answer everyone’s question which hasn’t been answered yet. If I missed your question then please let me know.

PEP 665: clarifications based on feedback · python/peps@ae53120 · GitHub should contain all the changes I mention below. PEP 665 – A file format to list Python dependencies for reproducibility of an application | peps.python.org with have the fastest available, rendered version of the updates. I have also updated the copy above.

Works for me! And I appreciate that the biggest ask so far has been about a name. :sweat_smile:

Done.

Would this eliminate the top-level needs key?

What if my lock file for development is different than production? If if I have different production lock files? What if I want a Read the Docs lock file, testing lock file, dev lock file, and prod lock file and they are all subtly different?

We could try to cram all of this into a single file and come up with some way to deal with conflicts (e.g. separate sections for each “grouping”), but we chose a file as it more visibly separated things and keeps the individual files at least a little smaller than they might otherwise be for trying to navigate.

Error out IMO.

Sure (that was my mistake in writing it that way to begin with).

It’s not necessary, but “source tree” is an official, defined term which matches what this is meant for.

How is this different than having that specified in needs?

1 Like

The PEP says the lock file is for “Python projects”. Is this meant to cover both applications and project development environments? This is unclear for me from the PEP as the former isn’t really mentioned. I think the motivation should start with use cases.

Talking about applications, I’d imagine a tool such as pipx would want to install applications from a lock file. Would it make sense then for applications to eventually do include the lock files in the sdist or wheel?

1 Like

It can, but all reasons we have both needs and required-by apply here as well, so no.

I believe it is. My personal view is that a Python library run in a local environment—development or otherwise—is an application (also a Python application is a library unless you vendor all of your dependencies including the interpreter, but that’s not relevant here), so the distinction is minimal. But I understand this application/library categorisation is very useful as a concept and important to many, so I agree the PEP should describe the use cases better.

That’s an interesting idea and definitely makese sense, but IMO there are many details to work out. One problem is a package release is basically immutable, and including a lock file in it means the dependency graph is frozen in time, along with all the security vulnarabilities and bugs discovered afterwards. This is not an issue with applications since by definition the project maintainers have control to the deployments as well and can upgrade the actual installations when needed, but this is not possible with versioned library releases. This feature is worth its own entire discussion an another PEP.

3 Likes

I don’t understand several aspects about this:

  • Why mix project installation requirements with lock files? They serve different roles: reproducibility for the latter and specifying (ideally) timeless - i.e. unlocked - dependencies to build/run for the former
  • Lockfiles are usually for environments, not individual projects - in particular, lockfiles of individual projects cannot be combined trivially (as mentioned above already); it’s clear that dependencies need to be specified per project, but how lock files should then be used at scale becomes very difficult (i.e. one user installing one library more than another means their environments might be completely differnt)
  • The name of the PEP says “installation requirements”, but it seems mostly about locking. In particular, allowing type="source tree" opens up a Pandora’s box of ABI concerns (and lack of reproducibility) that seem unaddressed.

I’m quite surprised that conda is not discussed as prior art here at all (not least considering the previous discussions on this topic: 1, 2) - it has successfully come up with a sufficient set of scaffolding to build the entire ecosystem in an ABI-compatible way across all platforms & arches, with or without GPUs, with package variants (e.g. OpenBLAS vs. MKL), etc. etc.

In particular, one of the crucial elements are the different kind of requirements that are distinguished in conda. It’s very different for a package to runtime-depend on numpy or to be using the C-API, where then the version used at build-time affects the version usable at runtime, etc. etc.

The following is a laudable goal:

It would then be very unfortunate to reinvent something in a way that then is incompatible with an approach that reached a much higher degree of functionality already.

I’m sure the conda(-forge) people would still be interested in this format discussion (ideally with the ability to eventually “speak the same language”), so tagging some people from the two previous threads (as well as a smattering of conda(-forge) people whose discuss-handle I found): @pzwang @teoliphant @msarahan @dhirschfeld @scopatz @jezdez @jakirkham @ocefpaf @kkraus14 @minrk

2 Likes

I honestly don’t understand most of the points you tried to make, and suspect we are using the same terms to describe very different ideas. So this is my attempt to describe the terminology used in the PEP, and explain why I feel you are not thinking the same things when using the same terms.

For most of Python packaging (from what I understand), a “project” merely means a bunch of source files grouped logically together and used in one collective logical context. When a project’s code is invoked in an environment, it needs some run-time dependencies, and a lock is used to describe those dependencies in a way that things external to the local environment does not affect how the description would be interpreted. In this context, per-project and user-specified requirements are naturally a part of the lock, since they describe intent (why a dependency is needed).

Using the above definition, if the goal is reproducibility, a project’s runtime environment is naturally coupled to the project itself, since every environment created to run the project should be alike (if not identical), and the lock is describing that abstract likeness. The PEP also does not mention anything about combining lock files from different projects (as you said, it can’t be done easily and should not be done generally), so I’m not sure how to make the rest of the paragraph.

As I mentioned in a comment above, a useful lock format is by no means one-size-fits-all, since there are a lot of practical definition to reproducibility. Yes, allowing things to build from source (not just source tree, but sdist as well) opens the door to things that technically break the strict definition of reproducibility, but from what I can tell (based on feedback from authors of existing tooling), most people don’t want that strict definition, and are willing and need to bend the idea for practical reasons. Since the PEP does not define what can create a lock file (but only an interoperable format between a locker and an installer), you are most free to create and use a tool that guaranteed the strictest reproducibility definition and only output such lock files if that’s your goal; and it would be usable for any installer consuming the lock file.

IMO this is out of the scope of a lock file format. As I mentioned, the format does not intend to force complete reproducibility. The PEP also does not invent any of the reproducibility features, because Python packaging already has ways to enforce those (wheel tags and environment markers), and the lock file format only needs to support them. Therefore, additional reproducibility features should not block the creation of a lock file format, since those features can and should be added to those existing mechanisms—and when they are, they automatically become a part of the lock file.

It is also weird to me that you feel Conda should be discussed as a prior art for a lock file format, because Conda (somewhat famously) does not have an equivalent to what other communities call a lock file (package-lock.json, Cargo.lock, etc.). The closest thing it has environment.yml, which is a list of user intents, and addresses nothing about how those intents should be interpreted and the reproducibility issues that come with the interpretation. So at this point I’m completely at a lost and don’t know how to continue, since we are most definitely not on the same page.

1 Like

It is also weird to me that you feel Conda should be discussed as a prior art for a lock file format, because Conda (somewhat famously) does not have an equivalent to what other communities call a lock file ( package-lock.json , Cargo.lock , etc.).

to this point, there have been efforts to create lock files for conda

so clearly that community is not seeing conda envs as a lock file

2 Likes

just wanted to echo this point since I think it got lost in the discussion.

I get that needs could be intuitive for beginners, but won’t a lot of ‘advanced beginners’ (like yours truly) be more familiar with requires?

4 Likes

Indeed, a conda environment is more than a lock file, in the sense that lock files are merely reproducible snapshots of an environment. As such, locking can be achieved trivially in conda for your current platform as conda env export -f my_env.lock and restored (anywhere, assuming the same OS/arch) as conda env create -f my_env.lock.

Where conda-lock comes in is that one might want to generate lockfiles for more platforms than the current one. That’s actually also a relevant question about the PEP: how does it deal with cases where requirements differ by platform?

1 Like

As such, locking can be achieved trivially in conda for your current platform as conda env export -f my_env.lock and restored (anywhere, assuming the same OS/arch) as conda env create -f my_env.lock

In my experience it’s not so trivial but your point is taken

1 Like

Thanks for your reply. Let’s try to take a step back. I agree that reproducibility is usually not that important, but since it is was one of the two key points in the motivation, I picked it up. I propose to shelve that aspect for the time being (in the context of this discussion). :upside_down_face:

The much more important thing is that - from my understanding of the term - lock files only make sense for environments (and that can overlap with the needs of a single project, e.g. the environment that people use to be on the same page when co-developing) - but perhaps I’m not getting an important aspect here.

Assuming we understand lock files similarly, it’d be fine if the goal of this PEP is just focussed on describing all the transitive dependencies necessary to install or work on a given library, but then it should IMO not use the words “installation requirements”, because that is a much broader concept in my view - people want to co-install packages (following the “installation requirements”) that need to share common dependencies (e.g. numpy), and then it becomes an environment question again, because different people will install different sets of packages.

This leads me to the second point. My mental yardstick is not a python-only project, but something that needs to be compiled (a very common case). And in such cases, there are then a whole lot of other “dependencies” (in the sense of factor affecting the build) that come into play. As a sidenote, I think it would be worth to sharpen the language around installation & runtime requirements, since these do not necessarily overlap once the project includes non-python code.

So IMO that’s a great goal, but not achievable for projects that aren’t just pure python without diving into some very tricky questions about being explicit enough so that “things external to the local environment do not affect how the description would be interpreted”. This is what I meant with considering conda as prior art, because it has solved exactly that question (and not with reproducibility as the primary focus).

I think this might be a crossed wire on the grapevine somewhere. It’s trivial in conda to create and use lock files. After doing e.g. conda create -n my_env python=3.9 numpy (and activating the env), the output of conda env export -f my_env.lock is (here for windows):

name: my_env
channels:
  - conda-forge
  - defaults
dependencies:
  - ca-certificates=2021.5.30=h5b45459_0
  - certifi=2021.5.30=py39hcbf5309_0
  - intel-openmp=2021.3.0=h57928b3_3372
  - libblas=3.9.0=10_mkl
  - libcblas=3.9.0=10_mkl
  - liblapack=3.9.0=10_mkl
  - mkl=2021.3.0=hb70f87d_564
  - numpy=1.21.1=py39h6635163_0
  - openssl=1.1.1k=h8ffe710_0
  - pip=21.2.2=pyhd8ed1ab_0
  - python=3.9.6=h7840368_1_cpython
  - python_abi=3.9=2_cp39
  - setuptools=49.6.0=py39hcbf5309_3
  - sqlite=3.36.0=h8ffe710_0
  - tbb=2021.3.0=h2d74725_0
  - tzdata=2021a=he74cb21_1
  - ucrt=10.0.20348.0=h57928b3_0
  - vc=14.2=hb210afc_5
  - vs2015_runtime=14.29.30037=h902a5da_5
  - wheel=0.36.2=pyhd3deb0d_0
  - wincertstore=0.2=py39hcbf5309_1006
prefix: C:\Users\[...]\.conda\envs\my_env

This specifies all artefacts in the environment down to the version, build number & build hash, which means recreating an environment from this lockfile will (generally**) be bit-for-bit equivalent (again, on the same platform) to the point in time where the snapshot was taken.

** except in exceptional circumstances; happy to go into detail if desired.

1 Like

Comment: while what you write about is a real pain point for projects with complex dependencies @h-vetinari, I don’t think it’s helpful to discuss it in the context of this PEP. Nothing in this PEP changes that one way or another. The scope and assumptions of this PEP are: use PyPI and wheels, and standardize lock files for use cases that mostly already work today.

The answer for “I depend on this native library that’s not on PyPI” already was “just bundle it in, or write in your project’s docs how to install it separately”, and that remains unchanged here.

2 Likes

Yeah, I can see how things would work with only wheels, but then type="source tree" should not be part of the scope of the PEP.

1 Like

I disagree.

The type is clearly specified as “something to build a wheel from” and it uses an already-established-and-standardised meaning for “source trees”. Same for sdists.

2 Likes