PEP 751: lock files (again)

Given that the UI for this in pyproject.toml is

[project.optional-dependencies]
docs = ["B<1.0"]
test = ["B>1.0"]

with no way of specifying “negative” optional dependencies, this isn’t going to fit well with how extras are typically specified.

As @DanCardin says, this is probably rare. And it’s basically about UI, not about what the file format will allow. So it’s a digression from the question at hand here, and I’ll drop it at this point.

2 Likes

I hadn’t, since I missed that detail until you pointed it out here.

Making sure I understand the scenario, the situation would arise with dependency declarations like:

# In the main dependencies
B; "os_name == \"nt\""

# In the optional dependencies
B; "sys_platform == \"win32\""

My initial inclination is to treat that the same as the conflicting dependencies case: if the dependency declarations need different environment markers for the same package, you need to put the dependency sets in different lock files when using the standardised format.

Adding complexity to the lock file format to directly support such a case feels like something that could readily wait for version 2, when specific examples can be given as to why neither “make the environment markers consistent” nor “emit multiple standardised lockfiles when the inconsistency can’t be avoided” are sufficient workarounds.

In principle yes, but it is more likely that this comes from transitive dependencies like

# In the main dependencies
B

# In the optional dependencies
C

# dependencies of B
D; os_name == "nt"

# dependencies of C
D; sys_platform == "win32"

It is not possible if inconsistent markers come from different transitive dependencies, which are developed by different parties - or at least I think it is not easy to convince them that sys_platform is better than os_name or whatever. Further, this is just an example but markers could be completely unrelated like python_version < "3.10" and sys_platform == "win32".

That is not sufficient if you want to be able to install both groups because then you have to create one lock file. By the way, different markers are not conflicting. If you have something like

# group/filter 1
B; os_name == "nt"

# group/filter 2
B; sys_platform == "win32"

this only means that you have to install B if you install

  • only group/filter 1 and os_name == "nt"
  • only group/filter 2 and sys_platform == "win32"
  • both groups/filters and (os_name == "nt" or sys_platform == "win32")

I think this complexity is required for Poetry but as already mentioned we could put this information in the [packages.tool] section as a workaround.

Oh, I see, that makes a lot more sense as a plausible scenario. Given that, I think the Poetry approach makes a lot of sense: allow markers (and file locks) to vary based on the active filters.

I’m less sure what a good spelling of that would look like, especially in the file lock case.

There are three scenarios I see:

  • unconditional in base deps, conditional in one or more optional deps
  • missing in base deps, conditional in one or more optional deps
  • conditional in both base deps and optional deps

If we give the default deps a reserved filter name (e.g. “default”), then those three possibilities collapse to just the third case, so it feels like that should be part of any solution.

If I understand correctly, uv’s lockfile and resolution semantics are similar to Poetry’s:

  • A lockfile is created for a project, which must have a pyproject.toml (or a workspace: a tree of projects with dependencies between them).
  • We lock the project with all of its extras and development dependencies (they’re not “allowed” to conflict). (By “all of its extras”, I mean the extras of the root pyproject.toml – not all extras of all packages.)
  • We generally attempt to minimize the number of versions of a given package that are included in the lockfile (but multiple versions can be included).

There are a bunch of differences in the schemas, but the semantics are similar.

The lockfile is designed such that determining the set of packages to install can be done via a graph traversal at install-time (no “resolver”): you start from the package root, add its dependencies (filter by extra, filter by normalized marker, etc.), then continue until you’ve traversed the graph.

We don’t support generating this kind of lockfile from (e.g.) a requirements.txt file, though we do support generating a “universal” requirements.txt output file from a requirements.txt input… This is effectively the lockfile described above, but converted to a requirements.txt by propagating markers along the edges of the graph, combining and normalizing them, and then writing out a flat list of fully-qualified packages.

4 Likes

I’ve now had some time to read the PEP, and I have a couple of high-level points I’d like to make.

First of all, the PEP is looking really good. A lot of work went into this, and it shows - thanks for sticking with it @brettcannon!

I don’t know if it’s just me, but I find it very hard to understand the structure of the lockfile from the field-by-field descriptions given. Would it be possible to add a “File Structure” section that have an overview of the structure of the file? Having a simple example of each of a file lock and a package lock lockfile (simplified to the point of only installing a single file/package, for brevity) actually in the PEP, rather than being linked, would also help.

Without an overview of the structure, I found it hard to follow the individual field descriptions, as I kept losing context.

In the “How to teach this” section, I think it would be worth including some comments on how to introduce users to the idea of having two types of locking - file locking and package locking. Tools will obviously have their own documentation, but if we introduce the concepts of file and package locking in the standards, tool documentation will need to change to be explicit about what they offer - and I think the PEP should make that clear.

10 Likes

I don’t have much to say because I think the PEP overall looks reasonable (and I’m constrained on time) but FYI this:

is not going to be an issue for Hatch because in that world there is the concept of environments which may have dependencies on their own separate from the project and each environment will have its own lock file.

Continuing the installation filtering subthread, I explored how far adding a new install_filter marker to the set of environment markers defined in Dependency specifiers - Python Packaging User Guide would get us in handling the “markers vary by installation filter” scenario. As with the current definition of extra, install_filter would be defined by context, and treated as an unknown marker in any other situation (such as when defining package metadata rather than processing a lock file).

The example below builds on the implementation sketch I posted previously:

However, the packages.filters array suggested in that initial post is omitted (as it becomes redundant given the ability to reference installation filter names in the marker field).

Based on the example below, I don’t think we need to define a default installation filter (but we may still want to reserve the name, and be explicit that omitting the always-installed default dependencies from the [[optional-filters]] table is intentional).

For example, suppose test requirements are defined as an optional dependency group, which the locking tool translates to a group-tests installation filter (even it the PEP doesn’t standardise the translation of extras and dependency groups to installation filter names, the example below still works since the installers would be treating the filter names as opaque strings either way - any chosen prefixing conventions would be purely for human consumption, as even a combined locking+installation tool should be using the tool tables to communicate information between the two operations, rather than relying solely on specific filter naming conventions)

In the package locking case, the new marker would be used directly in the marker field on package entries:

[[packages]]
name = "optional-dependency-example"
marker = "install_filter == 'group-tests'"

[[packages]]
name = "additional-marker-clause-example"
marker = "(os_name == 'nt') or (install_filter == 'group-tests' and sys_platform == 'win32')"

For the file-locking case, it would theoretically be possible to go the same way, and just treat each combination of base target + optional filter as an independent file lock. However, that approach would potentially result in a combinatorial explosion of file lock names in the individual locks arrays for common packages, such as: [win64, win64-with-group-tests, linux64, linux64-with-group-tests, macARM, macARM-with-group-tests]. It would be preferable if the base packages didn’t need to also be labelled with the lock names for the optional groups (that is, the fact that win64-with-group-tests includes win64 should be conveyed more concisely in the lockfile’s structure than by also listing win64-with-group-tests every time win64 is mentioned).

One way I see to do that is to add a notion of “optional derived file locks”: non-exclusive optional locks that are nested under a base lock and implicitly include everything that is part of the base lock. Something like:

[[file-locks]]
name = "win64"
marker-values = ["os_name == 'nt'", "sys_platform == 'win32'"]
# Including both marker variations in the file lock eliminates the need to
# care about how different packages choose to specify their "on Windows"
# extras. Platforms where the two are somehow inconsistent would be
# excluded, but that's likely to be OK given they shouldn't exist.

[[file-locks.optional]]
name = "win64-with-group-tests"
marker-values = ["install_filter == 'group-tests'"]

Note: while more capable locking tools like Poetry and uv should be able to figure out the required structure of a package lock file from pyproject.toml, the same isn’t true for full file locks. For those, the target environment descriptions need to be provided, since one of the key points of a file lock is that it doesn’t need to handle arbitrary environments, just a set of known targets. Instead, the locking tools would just need to be able to complain when a marker clause is referenced in a dependency declaration that the given file locks don’t cover (e.g. only os_name is specified in the file locks, but a dependency references sys_platform, or vice-versa). However, working out the optional locks to be derived from each given base lock would be back in the domain of the locking tools that support the definition of installation filters.

After the main (exclusive) file-lock has been identified, the environment markers for the optional derived locks would be evaluated, and if they’re all true, the derived lock’s name would be added to the set of lock names to check for when scanning the list of packages to be installed.

Such a structure would also be useful to more concisely handle cases where (for example) some dependencies are only needed on older versions of Python:

[[file-locks]]
name = "win64"
marker-values = ["os_name == 'nt'", "sys_platform == 'win32'"]

[[file-locks.optional]]
name = "win64-on-older-python"
marker-values = ["python_version < '3.12'"]

Rather than allowing for nesting of optional locks, cases where dependencies were only needed when multiple conditions were true would instead be represented as a third optional lock combining both marker conditions, such as:

[[file-locks]]
name = "win64"
marker-values = ["os_name == 'nt'", "sys_platform == 'win32'"]

[[file-locks.optional]]
name = "win64-on-older-python"
marker-values = ["python_version < '3.12'"]

[[file-locks.optional]]
name = "win64-with-group-tests"
marker-values = ["install_filter == 'group-tests'"]

[[file-locks.optional]]
name = "win64-on-older-python-with-group-tests"
marker-values = ["python_version < '3.12'", "install_filter == 'group-tests'"]

With this approach, the only optional locks that would need to be defined in the lock file would be those where they actually added additional packages.

(Conflicting requirements that have no compatible solution would still need to be separated out into distinct lock files, as I still don’t see a compelling reason for us to consider relaxing that requirement)

FWIW, when I’ve been coming up with examples in this thread, I’ve been using the heading navigation side bar in the “File Format” section of the PEP as my structural reference.

That said, I still agree with the idea of including a pair of short single-file-only examples that omit most of the optional fields, but still give a sense of what a package lock and file lock will look like (certifi would probably be a good candidate for that - pure Python with no runtime dependencies, so it’s pretty much the most trivial possible real world scenario, but also one where the potential impact of a successful supply chain attack is clearly significant)

2 Likes

In my opinion, the difference between per file lock and package lock is not clear, at least it is not worth having users learn a new concept. There is also confusion about how to generate [[file-locks]] correctly, as @charliermarsh mentioned. In fact, all existing lock files are actually package locks. Even for pip-tools which generates hashes for all installation artifacts when --generate-hashes is given.

From my understanding, [[file-lock]] only provides the installer with a chance to pre-check, allowing it to match marker-values and wheel-tags to find the corresponding environment name before installation, and then match individual package.files using the name afterwards. That is different from package lock strategy where it always iterate each [[packages]] and [[packages.files]] entries to evaluate the packages.marker or wheel tags encoded in the file names. There is some performance benefit but not much. We can unite them into the same format, can’t we? Because from the perspective of an installer, there must not be more than one matching packages.files for a given package for the current working environment. This holds for both per-file locks and package locks. The pseudo code to get installation set is as following:

def get_installation_set():
    installation_set = {}
    for package in lock.packages:
        if not package.matches_env():
            continue
        file = choose_best_file(package.files)
        if file is None:
            continue
        if package.name in installation_set:
            raise Exception(f"Multiple package files found for package {package.name}")
        installation_set[package.name] = file
    return installation_set

In this case, there is no need for [[file-locks]] to exist and the formats can be unified.

Some other comments on the proposal:

[packages.vcs]

  • Must be specified if [[packages.files]] is not (although may be specified simultaneously with [[packages.files]]).

Should we also consider packages.directory?

I am against allowing to alter the dependencies input at installation time, because it may require resolution step at installation time. The resolution result should reflect the top-level dependencies input and extras should be evaluated and stripped at lock time

1 Like

Probably, but since that future PEP hasn’t been written I don’t know if your assumption holds true (e.g., will it be dynamic or already recorded in a file somewhere what GPU is supported?). :wink:

If others feel the distinction is important I can add it (I assume it’s just a boolean).

I actually had the same thought this weekend. :grin: If the extras are recorded in the lock file then the possibilities are finite, and so you can craft a marker that ties a package version to a specific extra(s) and that can’t accidentally apply in other scenarios (thank you and and or). The key bit is the PEP would need to specify how to handle this since the way markers work it’s only a == or != option, so you have to iterate through all the extras to see if any of them are true. I’m not sure if PEP 751 would complicate this such that a new marker name would be neccessary or could be made to work with extra.

I also don’t know how the install_filter that @ncoghlan is proposing differs from using extra. Is it just a more generic marker name for the same thing?

No, other than “don’t worry about it”. What sort of specification are you after? I don’t think other packaging PEPs say what to do if a key is missing, just what to do it they exist.

That’s just an mistake/oversight on my part. :sweat_smile:

The reason I didn’t go with a graph is it makes auditing harder. For instance, how do you know when a new package in your diff will apply w/o walking the graph yourself? The linearity of the file format is on purpose to avoid that and to make auditing a diff not require other context from the file.

Yes, I can add a simple example inline to the file.

:+1:

I asked above about how this differs from using the extra marker?

Maybe. I don’t know what the common case of the number of entries in such a case would be (e.g., would most people lock for dev and production, the major platforms, all platforms supported by PyPI, etc.).

I believe so.

I think it could be unified, but the question is whether that hurts auditing in the per-file locking case? Admittedly I haven’t thought deeply about it since this whole discussion started w/ per-file locking and I added per-package locking later when I think I came up with a reasonable approach.

Let me think about it and if people like the idea of unifying and I think it could make sense then I can look at doing that.

Yep, that’s an oversight on my part.

3 Likes

Not quite. “extra” refers specifically to extra names defined in the metadata of a published package that can be referenced in dependency declarations, while “install_filter” would refer to optional filters defined in a lock file, which may not correspond to published extra names. Initially, they’d most likely be the superset of published extras and unpublished dependency groups.

That said, while I would expect to see extra-* and group-* as the two main sources of filter names, I don’t see a compelling reason to restrict them to covering just extras and dependency groups (instead prescribing only their behaviour, not their semantics). (For example, even in the absence of a standard for autodetecting GPU compatibility, a locker designed for that task may be able to generate opt-in “CUDA” and “ROCm” filters for different versions of binary dependencies)

For cases that can be correctly expressed as a package lock, that format makes the most sense to use. Since existing tools only do package locking, they’re not going to have their own immediate use cases for file locks.

The benefit I see to file locks is being able to express details that environment markers don’t cover, and enforce a stricter “in the face of ambiguity refuse the temptation to guess” behaviour on installers:

  • the lock file generation enumerates a defined set of environments
  • attempts to install into unknown or ambiguous environments will fail rather than resolving the closest match
  • installers will implicitly make the same assumptions as the locker did, since they’re only using environments markers to identify which named lock to use, they’re not comparing them directly to conditional dependency declarations

I think the optional filter concept makes that even more powerful, since it gives a clean way to separately express the dev/staging/prod distinction.

Edit to add this follow-up to a previous subthread of the discussion:

I finally got the use of pluralised nouns in the expanded array of tables syntax to click in my brain today by changing how I mentally parse a heading like [[packages]] when reading a TOML file: “append a new entry to the packages array”.

I had previously parsed [[item]] as “define a new item” (similar to the way [item] means “define item”), which is why the singular form felt more natural.

References to subfields like [[packages.files]] still read a bit oddly, but I make sense of them by reading them as equivalent Python code like packages[-1].files with the index being implied rather than explicit.

1 Like

I realized that PEP 735 and dependency groups may water down the usefulness of the separation. For instance, think of a dependency group to build your docs and another for your dependencies and testing needs. Those could be entirely disjoint, leading to no common packages.

I realized last night that if uv wants this but other people don’t speak up for it then it could be flagged in [packages.tool] so that uv as an installer can handle editable installs.

I think this could be viewed as a locker choice if we don’t want to codify it now. There’s nothing saying either approach is right or wrong, and so we could leave it open as an implementation detail for the lockers based on what the user asks for (if the PEP ends up supporting this scenario).

One thing I’m not sure about is whether a new marker is better than a some array like filters or dependency-sets or something that lists what group of dependencies the package version belongs to. What’s the specific win of updating marker support instead of using a key in the TOML? The marker approach has a nice consistency w/ how package versions are proposed to be filtered, but it does require updating the marker spec and tooling to handle that new key (i.e. packaging). Not the worst problem, but I think because of this I think readability and comprehension upon seeing a lock file for the first time should be the deciding factor if we go with something like this.

I have a concern here about how the user knows what names there are w/o introspecting the lock file ahead of time to know what they want? From a CLI perspective where tab completion doesn’t work, knowing how to tell the installer what you’re after w/o knowing the naming scheme of the locker might be a bit of a UX hindrance.

Right, which is why I was thinking your proposal was a generalization of letting you add a label to the marker expression to filter on, much like extra does. So violent agreement. :wink:

You can also view extras as a subset of dependency groups; :turtle::package: all the way down.

Playing Devil’s Advocate, you could still get that if you had restrictive markers on the package versions and minimized what files were listed (i.e. one or two). Since the PEP doesn’t say you can’t list a package and version twice but with varying markers, you could potentially get the same strictness.


What I’m hearing in the feedback is people want to minimize the scenarios that lead to multiple files if possible. That is manifesting itself as people wanting extras handled in the same file, and the same for PEP 735 if it gets accepted.

There also seems to be at least some push-back on separating per-file locking and per-package locking so explicitly, instead of potentially generalizing it and simply making sure lockers have the ability/expectation of only locking the files for a platform. I’m seeing this in both explicit asks and confusion on how to actually do per-file locking.

I haven’t made any decisions about my response to either of these things, but do know I’m thinking about them.

5 Likes

One of the key factors pushing me down the install_filter marker path was actually the way that the lock format proposal tries to abide by the practice of using arrays for variable length data (as opposed to optional-dependencies tables with arbitrary key names), so I didn’t even seriously attempt to work out how to handle per-filter environment markers with dedicated fields rather than a new install_filter marker that allowed it to be included via and and or operators in the existing marker fields.

Amending that oversight now, I ended up with two potential ways of doing it, and see pros and cons to both (although I still prefer adding the install_filter marker to either of them since it avoids a whole mess of complicated field interpretation rules that explain how to compose various fields together to express different scenarios that the marker syntax can already inherently handle).

Edit: simplified the proposal below to avoid needing file-locks.optional. While I still like that idea, one arguable advantage of giving installation filters dedicated fields is that it allows installation filter support in file locks to be defined without needing to define general purpose optional file-locks.

The first option avoids having to implicitly define a default installation filter:

  • add a new top-level [[optional-filters]] array (as in the install_filter marker suggestion)
  • add three new optional fields to the entries in the [[packages]] array
    • packages.optional: boolean indicating whether the package should be omitted if no named filters are enabled. Valid for both package and file locks.
    • packages.filters array: package is installed if at least one of the filters named in the array is enabled and any marker or lock conditions specified in the base table are satisfied. Valid for both package and file locks.
    • packages.filter-markers table: mapping from filter names to marker expressions. Package is installed if the filter is enabled and the given marker is true for the target environment. Only valid for package locks.

Package lock examples:

[[packages]]
name = "optional-dependency-example"
optional = true
filters = ["group-tests]
[[packages]]
name = "additional-marker-clause-example"
marker = "os_name == 'nt'"

[[packages.filter-markers]]
group-tests = "sys_platform == 'win32'"
[[packages]]
name = "optional-with-marker-clause-example"
optional = true

[[packages.filter-markers]]
group-tests = "sys_platform == 'win32'"

Optional file lock example:

[[file-locks]]
name = "win64"
marker-values = ["os_name == 'nt'", "sys_platform == 'win32'"]

[[packages]]
name = "optional-dependency-example"
optional = true
filters = ["group-tests"]

[[packages.files]]
lock = ["win64"]

The second option is similar to that one, but replaces the optional boolean flag with an automatically (but explicitly) defined default filter for the package lock case (if filters or filter-markers is defined but omits the default filter, then the package is optional):

[[packages]]
name = "optional-dependency-example"
filters = ["group-tests]
[[packages]]
name = "additional-marker-clause-example"

[[packages.filter-markers]]
default = "os_name == 'nt'"
group-tests = "sys_platform == 'win32'"
[[packages]]
name = "optional-with-marker-clause-example"

[[packages.filter-markers]]
group-tests = "sys_platform == 'win32'"
[[file-locks]]
name = "win64"
marker-values = ["os_name == 'nt'", "sys_platform == 'win32'"]

[[packages]]
name = "optional-dependency-example"
filters = ["group-tests"]

[[packages.files]]
lock = ["win64"]

I don’t find either of those approaches appealing, since they both force installers to care about how to compose these fields together, rather than just understanding how to pass a list of active install_filter names to their environment marker evaluation engine.

My interpretation of that feedback is a bit different: since the current crop of locking tools focus on generating package locks, the file lock format doesn’t address a need that they care about. However, the Expectations for Lockers suggestion doesn’t consider the prospect that a locker may choose to only emit package locks (and never file locks), or vice-versa.

That’s an entirely reasonable stance for a locker to take, especially those that are taking a high level set of abstract requirements and turning them into a transitively locked set of pinned requirements.

By contrast, I think file locking is interesting specifically because it doesn’t assume that that high level set of abstract requirements is even known: you can potentially generate a valid file lock from pip freeze in a specific environment (or @EpicWink’s example of generating one from pip install --report).

Simple repo API JSON spec says “clients MUST ignore keys that they don’t understand”. Some options for unknown keys (non-exhaustively) include “must ignore” (to support future versions), “must error” (to strictly prevent typos), and “should/may warn” (to combine options).


A locker can always choose to not have multiple extras in one file, and instead generate a file for each combination. The converse (having multiple extras in one file) isn’t true if the spec doesn’t support extras (etc).


I would regret losing the ability to easily audit the list of files to be installed in an environment, so much that I wouldn’t use this lock file format. I’m not saying that the generalisation necessarily makes this the case.

It’s also easiest to create a file-lock, with simple tools (not existing locking tools)

Example command (click to expand)
pip install \
  --quiet \
  --ignore-installed \
  --dry-run \
  --report - \
  'trimesh[easy]' \
  | jq '{
    version: "1.0",
    "hash-algorithm": "sha256",
    "file-locks": [{ name: "foo" }],
    packages: (
      .install
      | map({
        files: [{
          name: (.download_info.url | split("/") | last),
          lock: "foo",
          hash: .download_info.archive_info.hashes.sha256
        }],
        name: .metadata.name,
        version: .metadata.version
      })
      | sort_by(.name | ascii_downcase)
    )
  }' \
  | json2toml
3 Likes

Thanks for this detailed proposal! In order from most specific to most general. . . :slight_smile:

In the section on “Expectations for installers”, each of the subsections (for package and file locking) begins with “An example workflow is”. But then the list of steps includes things like “an error MUST be raised” in some circumstance or other. I found this jarring and I’m still not sure what the normative upshot is. To me, saying something is an example implies it’s not ruling anything out or specifying any behavior, just. . . giving an example. So a MUST inside an example is a contradiction.

If those MUSTs are really meant to restrict how installers may use the lockfile, I think the heading should be changed to something like “Installers MUST follow at least the following steps. . .”. Or if they’re not meant to restrict installers, then just remove the musts and say “in such-and-such situation, raise an error” (this being just an example of what a particular installer might do).

With regard to the package and file locking, I share some of the reservations others have mentioned. I just have this nagging feeling that these two kinds of locks are so different and their use cases so distinct that having them represented in the same file is going to lead to confusion. I don’t personally have a lot of need for the file-locking approach, so it’s hard for me to know, but I wonder whether people who need that (for security purposes) are going to really feel comfortable relying on this. @frostming said that “In fact, all existing lock files are actually package locks.”. Is this the case? If so, do we have responses from those who feel the need for this that this is what they need?

Curious if you can explain a bit more why you think the new way of thinking is better. :slight_smile: To me, “locked” implies that the set of installed packages should itself be locked, which means locking the entire environment state. What does it mean to install multiple lockfiles on top of each other? It seems to me that that would disrupt whatever the user thought they knew about what they can assume about the environment.

Finally, I still twitch every time I see python-requires. I still think every packaging proposal that is built on a “Python manages the environment” model rather than “the environment manages Python” is a step down the wrong path. I know no one else agrees with me on this but I just want to mention this for the record.

1 Like

Thank you for this PEP :heart: - I’m really excited to see the impact this will have on the ecosystem.

However, the Python ecosystem has historically had spotty support for private registries and while this has got a lot better based on your and others work, some parts of this PEP stand out to me.

Recording what package indexes were used by the locker to decide what to lock for was considered. In the end, though, it was rejected as it was deemed unnecessary bookkeeping.

Including a table of registry sources and a boolean indicating if the source requires authentication, with each package linking to this source would help tooling that consumes this lockfile.

For example, I have seen both of these cases:

  1. Internal packages may have names that conflict with packages found in the public PyPi registry
  2. The internal package registry may have a custom-built wheel for a specific version of an external package

With the current PEP, we would need to configure the same external package registries in some way for potentially two different tools (the locker and the installer). And if we don’t, one result for the above examples would be confusing “version not found” or “hash integrity failure” errors because it has resolved the package from the public pypi repository when it was supposed to resolve it from the private repository.

Including this in the lock file makes the sources consistent, and including an “needs authentication” flag allows the tool using the lock file to error out early if it is not supplied in some way, rather than attempting to resolve and hitting the same errors above.

There is also a follow-on issue with the package URL: In the example lock file the URLs included are the raw pypi CDN urls, i.e https://files.pythonhosted.org/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl

The PEP says this is:

Useful for documenting where the file came from and potentially where to look for the file if not already downloaded/available.

However this may need more thought when it comes to private registries. Firstly, it may be worth specifying that the URL should be stripped of any authentication credentials.

Secondly this is relying on the URL being stable, which it might not be. Take for example a registry that uses S3 or other blob storage to store files. The download URLs in this case are typically either:

  1. directly exposed as temporary “signed URLs”
  2. an endpoint that redirects to the temporary “signed URL”

In either case there are some issues with including this field:

  1. If the URL is temporary (like a signed URL), then it’s useless to include and each resolve will return new URLs
  2. If the endpoint redirects, should the URL be the source or target of the redirection?

Edit: I tested AWS CodeArtifact and https://pypi.org/project/pypiserver-backend-s3 - both of these return direct links on the same host via the simple index and do not redirect to a temporary URL. However I believe the point still stands: the simple index can return a link on a different host and the link may be a temporary one - there is nothing in the simple repository specification that says a URL must remain valid forever.

If we store registry sources then we can include this URL only for the public global PyPi source, because it is undoubtably more performant and we know that the URL is stable.

However even if the private registry doesn’t require authentication then there is no guarantee that the URL to the package is not a temporary URL (i.e a signed url with credentials embedded in it). If the download URL is on the same host as the registry, contains no URL parameters and is on a path that is relative to the simple index (i.e https://myregistry.internal/pypi/simple/boto3/1.28.32/boto3-1.28.32-py3-none-any.whl) then we can infer it’s OK. But if the url is on a different host then the field shouldn’t be included?

With the current wording of the PEP it seems that this URL would be used to fetch the package, and if the URL is temporary the the tool would hit a 4XX error and fail in some confusing way.

4 Likes

You can add Poetry to the list of tools that have this distinction and therefore will use such a field.

1 Like

Based on those suggestions I would agree.

I would hope you would be able to do that even if all of this was generalized (i.e. any PEP by me is going to make sure you lock down to the file somehow).

Ah, you want a generic statement. I thought you wanted a custom thing per optional key.

And I’m not sure what wording I would want in there since ignoring a key may lead to a security issue. I’ll probably add something say you can ignore keys w/ a warning as long as the major version number of the file format is supported, but you must error out if the major version number isn’t supported.

And you won’t lose that as that’s a key motivator for me. It’s just a question as to whether some unification can still support the readability requirement.

As one of the people who needs this, yes, I feel comfortable w/ this. :wink: What alternative do you think people who need this are using today that they would feel more comfortable with?

Yes, that’s a possibility. But then again, just because you found a file in one place when locking doesn’t mean that’s where you going to find it later (and the file hashes are there to help make sure you’re getting the right file no matter where you downloaded it from).

The problem with that flag is “some way”. I assume you’re thinking the tool just knows what that authentication is? I’m also concerned this heads down a slippery slope of people wanting to enumerate all the auth mechanisms which I certainly do not want to do.

I don’t want to enumerate every potential security concern and how to remedy them in the PEP. I’ll have to think about how far I want to take that.

It’s actually not, and that’s on purpose. It’s “documenting” something, so it doesn’t have to influence anything, and since it specifies where to “potentially” download the file.

I view the wording as saying you could use the URL, not would be used and if it doesn’t work well too bad. If that were true then your concern above about different indexes being used wouldn’t be valid as the direct URL would always, and only, be used.

I can add a clarification, though, that the URL is not expected to work and tools shouldn’t rely on it exclusively.

2 Likes

I committed PEP 751: address comments (#3883) · python/peps@43eb5fe · GitHub which addresses most of the stuff I said I would update (forgot to update the learning section like @pf_moore asked for in PEP 751: lock files (again) - #46 by pf_moore).

I’m also still mulling over what I said I was thinking about in PEP 751: lock files (again) - #53 by brettcannon .

1 Like