PEP 751: now with graphs!

mdrissi · November 6, 2024, 11:14pm

It provides stability for rest of dependencies which may be most of them. At my current work we have transitively roughly ~300 dependencies. We have like ~10 internal editable dependencies. Us having both does not mean there’s no value in having hash locks required for external 300 dependencies. Our security team is aware we have internal editable dependencies and that’s acceptable for us.

In practice our workaround today due to requirements.txt files not allowing mix of hash locks + editables is we make essentially two requirements.txt, requirements.txt (normal dependencies), requirements-editable.txt (editable dependencies) with both being installed one after other and first being hash locked and second not. This workaround works fine. There’s several years old issue on pip tracker for people asking to allow requirements.txt file with hashes to permit some unhashed entries for editables, editable cannot be installed when requiring hashes · Issue #4995 · pypa/pip · GitHub.

edit: Also key reason that both relate is we are locking list of editable dependencies. That is our top level requirements. Thankfully no-emit-package exists for pip-compile/uv to exclude these input.

mikeshardmind · November 6, 2024, 11:23pm

Yeah, seems so on the mismatch of perspectives, but unfortunately I’m not sure my perspective is actually uncommon^[1], perhaps this is a place where open source is currently at odds with closed source industry use, but I worry about the effects this will have if industry consistently sees the open source solution as insufficient while at the same time being unwilling to cooperate openly. I don’t think anyone should feel compelled to try and solve that without that cooperation happening, and if that expands the scope of this too much for your comfort right now, that’s entirely reasonable.

I don’t think changing the spec in the future will solve this unless it’s done in a way that invalidates the prior spec and that tools may not support older versions, something I don’t see as likely to be accepted. If the version 1 format is still supported after that update, then the same issue of insecurely configurable tools exists.

I’ve heard a lot more from former coworkers who have gone to other jobs adopting similar views on dependency management, and a lot of preemptively trying to make things more secure with the expectations of impending stricter regulation. ↩︎

ofek · November 6, 2024, 11:29pm

What do you mean by “platform”? Do you mean just an operating system like macOS or are you including the architecture and other metadata?

This is something that was lost AFAICT when the PEP was rewritten. It’s not very easy to tell that a lock file was resolved for, as an example, Linux with an older version of glibc.

I’m curious to hear @charliermarsh’s thoughts on this but being mostly finished Hatch’s implementation of workspaces I don’t think this should be in a lock file. This is just a bit of input metadata indicating that you should resolve dependencies from a particular directory and that the inputs should be continuously updated during usage.

brettcannon · November 6, 2024, 11:34pm

There’s nothing in the PEP – or any other packaging spec that I can think of – that requires you support older versions of a spec.

Sure, but you don’t have to use it.

How about this: are you up to start a separate topic to discuss how to hash a directory of files? If you can reach consensus fast enough then I will incorporate the approach into the PEP (my guess is the PEP won’t get any accepted any sooner than next month).

brettcannon · November 6, 2024, 11:40pm

Anything covered by environment markers.

Yes, that was one of the compromises currently in the PEP. I’m okay with bringing something back to support this.

charliermarsh · November 7, 2024, 12:42am

Are you asking if I think local projects should be represented in the lockfile? Or if “install as editable” vs. “install as non-editable” should be encoded in the lockfile?

charliermarsh · November 7, 2024, 12:48am

This feels a bit unfair. Poetry and uv both support this and have very broad usage across industry. I talk to companies every day that are using these tools. Cargo also supports this – you can have “local source trees” in your Cargo.lock.

mikeshardmind · November 7, 2024, 12:51am

Right, but similarly tools have historically not wanted to suddenly break existing uses, and you’ve inverted the statement there.

Yeah, and that seems like a possible outcome at work. I think even as proposed this isn’t an overall bad format though. (right now, I still think pdm’s current format is my favorite of those I’ve seen, but this isn’t bad)

I’m not sure there’s a great way to get a consensus on this, but I think it’s at least worth trying for, so Ill propose a method either later tonight or tomorrow so long as time permits. I have an idea in my head, but I’ll need to work through it as think it’s going to be a desirable property that we only care about the contents of the files and not file system metadata. I want to limit the hashing to what is strictly relevant to allow things like differing file system/tooling behavior on “last updated” to not be a factor if the content is the same.

mikeshardmind · November 7, 2024, 12:56am

It’s certainly possible that my use case is more constrained than the industry at large here. I view the cargo lock file situation similarly, yet differently though. We don’t ship cargo into production systems, so any rust code not compiled by the build system doesn’t exist in production. In this way, any local/internal source is very possible to audit in better ways. Editable installs fail this.

EpicWink · November 7, 2024, 3:08am

I was being cheeky and co-opting someone else’s words for my own nefarious purposes .

Hmm, I think I misspoke. I don’t have anything to say on the source tree discussion (unless I’m now misunderstanding). I always want to record (lock?) at least an sdist or wheel in our usage of the lock-file.

The only change to the PEP I request is making URL (and path) optional, as that field is useless to us. I’m happy for the default for URLs (and paths) to be included. I don’t think it’s particularly difficult for an installer to get a wheel’s URL given: package repository index URLs, the package name, and the wheel’s filename.

Ahh I just realised, the wheel (and sdist) filename is not in the PEP, that’s the piece I’m missing. Knowing that: I suggest adding a filename field which is required if path and URL are not provided. Sorry for spilling ink.

No, I think that would surprising and insecure if field values in a lock file are ignored.

Perhaps I shift the perspective and say that installers (or more generally, readers of lock files) are the security tools. They should be the one that ensures security, and the lock file is the information that installers need to do that. This means installers should be free to except for whatever security failing they decide.

radoering · November 7, 2024, 5:46am

If Poetry can create a lockfile, it is always possible to extract a solution (i.e. list of nodes) for each possible environment from this lockfile. However, installation might still fail for a specific environment if a node from the lockfile, which is required for this environment, does not contain a wheel suitable for this environment (and no sdist / source tree to build from).

brettcannon · November 7, 2024, 6:50pm

We won’t know until someone tries. But hopefully if this is important enough to people we can figure something out.

Thanks!

Based on no one loving the separation of metadata at the TOML level, I will probably update the PEP to either require the filename or say that if the filename in the path or URL do not match what is expected, then the filename must be provided.

That’s the assumption/approach I had in my head, hence the initial line in the PEP about saying installers should default to not using sdists (which will probably change to saying installers should provide a way to ignore sdists based on pip and uv feedback). While I agree the file format should try and provide as much data as reasonable to do things in a secure way, it’s still up to the installer to implement everything appropriately.

steve.dower · November 8, 2024, 10:33am

FTR, while I’m enjoying the side discussion about hashing directories of files, I can’t see any practical application for a lockfile.

The lockfile itself is not inherently cryptographically verifiable. We trust it, whatever it says. If you need it to be more trusted, then it’s up to you to use a verification mechanism when you bring it onto the system. Such a mechanism can also be used for verifying your directory of files.

I don’t want lockfiles to be able to refer to loose files above itself in a directory hierarchy (i.e. it can’t say install ..\..\..\malicious_code.whl), and I would argue that with that constraint, we can define adjacent loose files as equally trustworthy.

There should be nothing stopping installers from having additional options to forbid installing from a source tree, or even doing it by default. These are UX questions, and can be debated on tools’ issue trackers. This is precisely why we define interoperability standards, rather than building one single tool - because different people have different needs, so we try to get them to align on data sharing rather than behaviour.^[1] I fully expect some installers won’t support building from source anyway (probably not the “major” ones, but I could see CI system integrations requiring prebuilt packages).

We can specify how to reference a directory (rather than a package), we can advise that allowing them to come from outside the lockfile’s cone is definitely a bad idea, and we can say how they’re intended to be used, but tools get the final say in whether/why/how they use the information.

Though there’s a tendency in recent proposals – not Brett’s – toward trying to constrain users into certain styles/workflows. That kind of thing is an easy -1 from me. ↩︎

ntessore · November 8, 2024, 10:50am

I thought being able to specify in a lockfile “install from my_data_analysis folder with hash xyz…” would be quite useful for reproducible science — merely as a tool for preventing mistakes.

steve.dower · November 8, 2024, 10:54am

If there were mistakes in transferring the folder, there could be mistakes in transferring the lockfile.

The point is, we need the hashes to verify other transfers, not those that are included with the lockfile. So anywhere the lockfile says “go here to get a package” where “here” is not within the user’s own control, we want the additional check. Once we’re referring to files entirely within the control of the user (and any malware already on their machine), security-wise all bets are off.

ntessore · November 8, 2024, 11:00am

Right, I was mainly trying to point out an ecosystem where, for any number of reasons, it’s common to have “packages” that only exist in the form of scripts in folders. In that case, it would be useful to lock something like “data analysis for my paper”, with no concern for security whatsoever. In the spirit of

steve.dower · November 8, 2024, 11:04am

I would “lock” them into a wheel Or a Git repo, or some equivalent package, rather than trying to lock them as a (potentially aggregated) list of hashes in an adjacent file.

mikeshardmind · November 8, 2024, 12:05pm

I mostly agree with this, but my impression from both a few comments in this thread, the prior thread, and my offshoot thread about standardizing on how to hash directories is that people care about cases beyond lockfiles existing in the same “cone”, with this in mind, I think there’s two options:

include the hash, allowing support for directories other than the one where the lockfile exists.
don’t include the hash, only allow files from the same directory under the trust principle you laid out.

The case I don’t want to see be possible is one where installers allow outside sourcing outside the directory without a hash.

I do think there’s both security and non-security benefits to hashing the directories even with the current directory limitation, but the security ones on that narrow combination are defense in depth/ease of security tooling arguments that many have not always found persuasive.

The ease of tooling one applies to non-security stability/reproducibility cases as well, though I agree with you in theory here as well:

I’m aware that reality isn’t always so neat, and that it’s much easier to say that from an experienced perspective, and that many people are not going to understand why that’s such a common recommendation, especially if they haven’t had to deal with it from the perspective of a security or regulation background and are just doing what works and what has worked up until now… in that respect, we should want any obvious solution on the user side to not have an associated trap.

steve.dower · November 8, 2024, 1:47pm

Agreed on these two options, and it’s entirely pragmatism that helps me land on the latter:

a hash in the lockfile likely doesn’t remove the need/desire/requirement to validate the loose files anyway
calculating the hash is complex and expensive (and hence, probably error-prone)
the scenario (loose files from outside) doesn’t seem likely to arise that often^[1]
actual instances of this scenario (e.g. using a directory of loose wheel files in preference to a remote index) are covered by single file hashes
alternative approaches (create a package) are likely better in other ways (e.g. portability, familiarity)

So both options are certainly valid, but I think the cost/benefit leads us to not worrying about making hashes for loose directories part of the interoperability spec.

There’s a monorepo scenario for sure, but all the monorepos I’ve ever worked with have ended up with custom tooling for other reasons. Integrating your custom editable installs into that tooling seems reasonable - lockfiles aren’t trying to replace this aspect. ↩︎

charliermarsh · November 8, 2024, 8:20pm

I think this question is still sort of unresolved, and resolving it requires that the PEP think about what interoperability means for the lockfile… If we want uv to be capable of installing a lockfile that’s been generated by Poetry, how do users tell uv “what to install”?

The lowest common denominator thing would be: to implement the spec, an installer MUST support installing a group by name. That’s really straightforward, and maybe it’s what we want? But it’s also sort of “low level”. For example, I wouldn’t expect uv users to type things like uv lock-install --group "root~test" to install the test group from the root project’s pyproject.toml file – I’d expect them to use uv sync --package root --group test or even uv sync --group test if it’s unambiguous. So most uv users wouldn’t be using the “standardized” API (to the degree that such an API exists) if they’re living entirely within the uv workflow, but they could still use uv to install from a lockfile generated by any tool with a little more effort (figuring out the name and typing it out).

(I kind of think that’s fine, but as-is it feels unaddressed.)

Anything more than that, and I think we need to be more opinionated about a lot of the concepts. For example, we arguably would need to define how groups are named or something like that, so that uv and Poetry and other tools could commonly translate commands like uv sync --package root --group test into lockfile lookups. But even then, I’m not sure how this would generalize to things like “resolving the lowest and highest versions in a single lockfile”, which is a current goal.