How to validate lock files for security?

dimaqq · December 16, 2024, 5:12am

Suppose an external contributor opens a PR where they update some dependency, and the lock file entry (proposed) reads:

[[package]]
name = "bcrypt"
version = "4.2.1"
source = { registry = "https://pypi.org/simple" }
wheels = [
    {
        url = "https://files.pythonhosted.org/packages/bc/ca/e17...8c3/bcrypt-4.2.1-cp37-abi3-macosx_10_12_universal2.whl",
        hash = "sha256:134...c17",
        size = 489982 },

I can check the “name”, “source” but I can’t do sha256 in my head.

I’m relying on packages/prefix/prefix/hash/bcrypt-... to actually come from bcrypt.

Is that guaranteed?

What happens if someone pushes somename-vvv-cpvv-abi...whl to PYPI where metadata indicates some-other-name as package name? Does PYPI reject those and guarantee that every existing wheel name matches metadata?

sirosen · December 16, 2024, 5:49am

I think the review has to be tool-assisted in some way for this?
e.g. You have a CI step which runs

${ULTRA_COOL_WORKFLOW_TOOL} validate-lock
${ULTRA_COOL_WORKFLOW_TOOL} explain-lockfile-changes --as-github-pr-comment

Obviously such features require a very cool workflow tool, but I think that’s realistically the only way.

For example, what if you want to consider it an error for a lockfile to be missing wheels for supported targets? (Assuming those wheels exist on pypi.) I can’t check the contents of pypi just in my head anymore than I can hash a wheel in my head.

These two questions are tied together. I know that pypi allows some filenames which aren’t normalized perfectly (e.g. you can have an sdist with foo-bar rather than foo_bar), but I’m nearly certain that the normalized name parsed from your filenames is required to match the normalized name of your project.
That protects against both potential issues.

It would be good to confirm this if you are concerned – we have specs, friendly pypi maintainers on this forum, and you can always just try it out with a dummy package and see what happens.

JamesParrott · December 16, 2024, 9:53am

Assuming whatever uses the lockfile enforces hashes, their values can be compared to here:

File details

Details for the file bcrypt-4.2.1-cp37-abi3-macosx_10_12_universal2.whl.

File metadata

Download URL: bcrypt-4.2.1-cp37-abi3-macosx_10_12_universal2.whl

Upload date: Nov 19, 2024

Size: 490.0 kB

Tags: CPython 3.7+, macOS 10.12+ universal2 (ARM64, x86-64)

Uploaded using Trusted Publishing? Yes

Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bcrypt-4.2.1-cp37-abi3-macosx_10_12_universal2.whl|Algorithm|Hash digest||
| — | — | — |
|SHA256|1340411a0894b7d3ef562fb233e4b6ed58add185228650942bdc885362f32c17|Copy|
|MD5|d030d446ee81b3b7e43083276b92e0ff|Copy|
|BLAKE2b-256|bccae17b08c523adb93d5f07a226b2bd45a7c6e96b359e31c1e99f9db58cb8c|

pf_moore · December 16, 2024, 11:30am

No, but you can download the file and use a hash calculating utility of your choice. Or if you trust PyPI, you can check the hash displayed on the PyPI webpage. You can also check the size, which is a quick sanity check rather than a guarantee.

You can check if that is the case by reviewing the package. The metadata standards say that it’s not valid for the filename and metadata to fail to match, but someone can easily create an invalid file by renaming a valid one. PyPI might validate, I haven’t checked the PyPI docs or code, so I can’t say for certain - for a wheel they might, but it’s unlikely they will for a sdist as that would require them to execute untrusted code.

Ultimately, you need to decide for yourself what threats you want to defend against, and how much effort you want to put into checking. The lockfile ensures that the installer will only install the file you expect it to (based on URL and hash) but it’s up to you to check that the file it will install is the one you want.

So basically, you need to:

Review the file at the URL given to make sure it is what you think it is.
Check the hash manually, to ensure that it’s not possible for someone to change that file later without needing to change the lockfile.

How you do those checks is up to you.

JamesParrott · December 16, 2024, 12:02pm

To my mind the threat of concern here is the PR author replacing bcrypt in the target project, by a malicious project also on PyPi, that has had a wheel uploaded to it, that has been renamed to look like a bcrypt one.

It’s good to know the names not matching is possibly rejected by PyPi or pip, but I wouldn’t rely on that.

Checking the hashes for the wheel are the same as whatever the bcrypt authors uploaded, limits the risks to those originating there. Which don’t just affect projects using lockfiles. And which would cause much bigger problems for the world than pulling in malicious code into a single downstream PyPi project.

oscarbenjamin · December 16, 2024, 12:34pm

How do you imagine that a non-malicious contributor would literally update the lockfile?

Presumably they should be running some command to do it and what you need to validate here is that running that same command gives the same lockfile. I would have thought that the part that you should be auditing carefully is where this contributor has edited the inputs to the locking tool whereas the outputs are easily verified by running the tool.

JamesParrott · December 16, 2024, 1:06pm

You’re not wrong. But that just means there’s no point in merging such PRs, if the maintainers have got to recreate them from scratch.

pf_moore · December 16, 2024, 2:21pm

Trust is always a balancing act. If you don’t trust any external contributor at all, then yes, you have to do all the work yourself. But very few, if any, open source projects offer an absolute guarantee to their consumers either - so you do a reasonable level of checks, run your test suite, and assume your users will either trust your processes or do their own deeper checks if they want to.

JamesParrott · December 16, 2024, 3:13pm

Absolutely. Dima’s basically asking for tools that could help with that. Such suggestions would be of use to all project maintainers, and needn’t require new features to be added to pip or PyPi, or have any impact on those at all.

oscarbenjamin · December 16, 2024, 4:42pm

I think we don’t understand each other. At least you haven’t understood what I meant but also it is possibly because I haven’t understood how lock files work.

I’m imagining that the diff in the PR is like this:

diff --git a/requirements.txt b/requirements.txt
index 5422d53..b89d53c 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,3 +1,3 @@
 foo
-bar == 3.1
+bar == 3.2
 baz

diff --git a/lockfile.txt b/lockfile.txt
index 5159d95..b6b5939 100644
--- a/lockfile.txt
+++ b/lockfile.txt
@@ -1,4 +1,4 @@
 [[bar]]
 -version = 3.1.2
-hash = 12309123i82109458409583409583409583409
+version = 3.2.2
+hash = 1203980485739876482375684765835

The PR author has updated the version pin in requirements.txt. They have then run a tool like:

lock-tool update -r requirements.txt lockfile.txt

Now you as a reviewer of the PR should look closely at the manual changes in requirements.txt. The other change should be verified by a CI job that runs lock-tool and verifies that it gets the same lockfile with same hash etc. Perhaps the tool provides this as a command for your convenience:

lock-tool check -r requirements.txt lockfile.txt

JamesParrott · December 16, 2024, 5:01pm

That’s very clear - thankyou. If anyone knows of such a CI job or Action, that can automatically run lock-tool update and throw an error if the file produced isn’t identical to the same file in a PR, it would be super helpful if they mentioned it. Ditto if anyone knows of something that supports lock-tool check already.

Until then, I stand by my original point. The PR requires just as much work to check, as it did from the author to create it. There is a point to raising such PRs - we maintainers may need a nudge to bump a dep. But other than encouraging the PR author and fostering community engagement with the project, there’s little point in merging them.

mikeshardmind · December 16, 2024, 5:08pm

If you’re using a lockfile for security reasons, I wouldn’t allow 3rd party contributions to touch the file. In any case, even that of just using it for reproducibility in CI, I would look at any change that introduces or changes dependencies with significantly higher scrutiny as it’s asking your project to extend the trust your users have for your project to the added dependency.

dimaqq · December 17, 2024, 5:17am

That was exactly my concern.

If the Warehouse ensures that every https://files.pythonhosted.org/***/bcrypt-***.whl file comes from bcrypt project, which I trust, I only need to manually check the new libraries, that is projects I don’t yet trust. That’s manageable.

If the Warehouse doesn’t provide such guarantee, then manual validation doesn’t scale, and I must build out CI to e.g. auto-validate lock file changes using tools, and the tools aren’t allowed to run sdist setup.py in that scenario, as that would be a great injection point to override tool output and retval.

But that just means there’s no point in merging such PRs

I was thinking if I should even allow my CI to run on this PR.

pf_moore · December 17, 2024, 9:06am

Ah, I see your concern now. No, I don’t think PyPI guarantees you can assume this just from the URL. But you can look at the file information on the PyPI page for bcrypt, and confirm the PR refers to a URL for the project and the correct hash.

That is what I’d do in a case like this. I wouldn’t bother automating it (I doubt it happens often enough to be worth it), I’d just make it part of the manual review of the PR.

JamesParrott · December 17, 2024, 12:23pm

Firstly, I think it falls under the type of PR like a bug report, for which it’s reasonable to ask for steps for reproduction.
Secondly, What was the goal of pinning versions in the first place? Maintainers aren’t obliged to bump dependencies on a simple innocent request. Users can upgrade a dependency to the latest version themselves easily enough with pip. Maintainers may reasonably prefer to be late adopters.

If you do want to adopt that version, the first and last 3 characters of the SHA256 you posted match what’s on PyPi for bcrypt. I’d just double check the whole thing is as below, and sleep easy:

1340411a0894b7d3ef562fb233e4b6ed58add185228650942bdc885362f32c17

ofek · December 17, 2024, 6:55pm

Agreed, I would have changes to direct dependencies trigger a workflow that constructs the lock file and pushes to the PR’s branch.

brettcannon · December 17, 2024, 9:21pm

I believe attestations with trusted publishers can give you that guarantee if the project is hosted on a platform that trusted publishers supports and they use it. With that you can then verify that a file came from a specific workflow (e.g., a specific GitHub Action from a specific repo).

dimaqq · December 17, 2024, 10:27pm

That’s true. I guess I was hoping for a shortcut. Something easy for a human reviewer.

I gather from these responses that this should be a machine job, not a human job.

oscarbenjamin · December 17, 2024, 10:58pm

I think if the lockfile can only be generated by a machine then it should probably also be for a machine to verify it. There should be some way for a human to be confident that the machines are doing their job though.

brettcannon · December 18, 2024, 12:31am

Verifying the attestation can be done by a person; look to see if the URL of the publisher matches expectations.