Establish publisher authority via automated DNS backed challenges?

ncoghlan · September 10, 2024, 8:35am

Splitting this concept out into its own thread:

One of the key realisations I have taken from the PyPI prefix reservations threads is that they’re fundamentally about PyPI clients asking the question “Can the publisher of this package reasonably be trusted?”, with the goal of helping tools to detect typosquatting and similar styles of attack.

PEP 752 and PEP 755 put the burden of establishing trust onto the PyPI “organisations” mechanism, and a manual review process, and then attempting to convey the result of that review to clients via a prefix reservation mechanism. Despite @ofek’s valiant efforts, I’m starting to doubt it’s going to be possible to wrangle that into a form which is both effective and feasible for the PyPI admins to manage.

The suggestions from @takluyver and @petersuter in the linked thread put me in mind of an entirely different model for establishing trust: the automated challenges that Let’s Encrypt uses to confirm that a client controls a domain name before issuing certificates for that domain. (Presumably Thomas and Peter also had the DNS-01 challenge in mind when making the suggestion to do something similar). If it’s good enough for Let’s Encrypt to issue wildcard TLS certificates, presumably it’s good enough to trust for believing that PyPI packages come from an approved publisher.

As a completely unreviewed first sketch of that idea, I think it would need at least the following components:

a repository API project metadata field. Let’s call it domain-authority: it references a domain name (or perhaps a full URL) representing the “publishing authority” for that project.
a protocol for clients and repositories to use to confirm that a domain-authority API entry in a Python package repository JSON response is valid (such queries would be sent to the domain authority URL, NOT to the repository server). Both HTTPS and DNS seem like plausible candidates here, but HTTPS would probably be simpler (and avoid various categories of attack that would otherwise arise due to these values being static metadata rather than the dynamic challenge tokens used in ACME).

Creating a project with a domain authority set would be a three step process:

Create the project on the repository (so you can be sure the name is available)
Add the entry for the project on the publisher controlled domain authority server
Set the domain-authority field in the repository project metadata (which will now pass the repository server’s validation check due to step 2)

On the client side, there would be three options for handling domain authority validation:

Ignore the new metadata field (this preserves the status quo)
Accept a list of approved domain authorities, and warn for any packages that don’t come from an approved domain authority. Trust that the repository has already checked the validity of the domain authority entries in the API response rather than checking them independently.
Similar to approach 2, but independently verify the domain authority claims. May fail the installation if any of the referenced domain authority servers are down.

While I’m sure there are devils lurking in the details, the basic idea feels plausible to me, and more compatible with the historical free-for-all that is PyPI package naming. It also avoids putting any additional publisher review burdens on the PyPI admins (since that part of the problem is offloaded to certificate authorities and DNS registrars).

(There are also some echoes of the “maximum trust” TUF model here, but accepting substantially weaker security guarantees as a result of avoiding the key management logistics involved in enabling end-to-end artifact signing)

takluyver · September 10, 2024, 9:34am

Thanks Alyssa! I was vaguely thinking of the ways you can prove domain ownership for getting TLS certificates, though I didn’t remember that in detail.

To understand the proposed mechanism a bit better, is the idea that the domain-authority field (on PyPI) would contain something like "pypackages.google.com", and then some information available from that domain would tell you “these are the PyPI packages we (Google) publish”?

Then you could install with an option like --allow-domain-authority pypackages.google.com or --allow-domain-authorities domain-list.txt to whitelist who you trusted? Or even something mapping individual package names to the expected publisher.

That seems like a nice step forward for the very security conscious, especially organisations using many of their own packages. And it avoids the awkward special casing around pre-existing packages. But the people most vulnerable to typosquatting are the people typing pip install django by hand, and it’s hard to imagine many of them will pass extra info to validate it. So the potential benefit seems fairly small.

The other half of my idea was to base distribution names on domain names, e.g. com.google.* for packages from Google. That has its drawbacks, not least that it’s a big upheaval because ~no existing Python packages are named that way. But I wanted to mention it again here, so it doesn’t get entirely lost in the switch to a new thread.

pf_moore · September 10, 2024, 10:05am

I think this is an important point. We have mechanisms already (hash checking, for example) to help security conscious users to ensure that they are getting the packages they want. And to be blunt, if a security conscious user falls prey to a typosquatting attack, their security protocols aren’t as good as they think they are…

The target audience here seems to be people who aren’t making an effort to validate their installs. In particular, that implies that anything that doesn’t by default loudly warn the user on install that they are installing an untrusted package, is probably not going to help a good proportion of the people we want to help.

I guess another target group is people browsing PyPI looking for suitable packages for their use case. But again, if they aren’t looking at the publisher, and checking the package details, I’m not sure how much we can do that’s going to be any more effective than what we already provide.

So I think I’m still struggling to understand the point of all this. Is it purely to give publishers a better sense that they’ve somehow protected their piece of the namespace, because they don’t trust the current informal process?

mikeshardmind · September 10, 2024, 10:12am

Doesn’t trusted publishing already cover this in a stronger manner? as @pf_moore said, if people aren’t checking this, then the extra mechanism does nothing, if people are checking it, we have better tools already. (hash checking for installs, trusted publishing for establishing provenance, etc)

ncoghlan · September 10, 2024, 11:07am

“Trusted publishers” are a different beast - those are a way for project owners to delegate upload authority to a CI service so it can upload packages to the repository.

Hash checking is also applicable at a different stage of the process: that’s about ensuring that if you re-run the same install, you’re getting the same artifacts.

Neither of those help with questions like “Is this azure-utility project I’m about to install actually published by Microsoft?”.

PEP 480 would theoretically help (hence editing the first post to mention it), but there are significant logistical key management issues with the end-to-end signing support in that PEP (and development on its repository-to-client signing prerequisite, PEP 458, is ongoing).

That means the current state of the art for guarding against typosquatting is to maintain an explicit approved projects list, and manually review any additions to that list. Automated domain authority checks would just allow batch approval of all projects published by a particular domain authority without having to approve each one.

My main interest is in managed approval processes, as even when I’m the one designing them, my experience has been that there’s currently no good way to make them not awful for developers and reviewers alike. A step towards being able to focus on reviewing and approving publishers instead of having to closely review every new project would be helpful on that front.

To help in the general “naive user with no supporting infrastructure case” is a much harder problem that isn’t solved even for TLS (since the Let’s Encrypt challenges only check that the client controls the specific domain being requested, they actually make typosquatting HTTPS easier rather than harder).

Helping those users would be a good thing, but this idea isn’t that, and I genuinely don’t know what such an idea would even look like (aside from the rules about Unicode confusables that are already in the naming specifications).

Reserved namespace prefixes would help a bit since they’d limit the typosquatting target to just the reserved prefix rather than the entire package name. I also recall some issues that were kicking around in one issue tracker or another about potentially notifying PyPI project owners when another project is created within a certain Levenshtein distance of an existing one (which could also be considered for reserved prefixes if they’re added).

takluyver · September 10, 2024, 11:13am

Got it. But if you could do this by PyPI organisation (e.g. jupyter) or user (e.g. aws*) associated with a project, would that meet your need? It seems like that would require new tooling, and maybe new APIs on PyPI, but no new standards.

*I presume a lot of things that could be organisations are users because they were set up before organisations were introduced.

pf_moore · September 10, 2024, 11:27am

OK. What I think this says to me is that any proposals in this area (meaning this proposal, as well as PEPs 752 and 755) need to me much clearer about exactly what problems they are hoping to solve, as well as how they expect to help.

As @takluyver says, it feels like this should be something we can cover with the organisation support on PyPI (and if we can’t, I wonder what the point of the organisation feature was anyway).

My other thought is that if this (or a solution based on organisations) is something that we want to promote, then where does that leave PEPs 752 and 755? Have we just taken a big chunk of their motivating use cases out from under them, or are they useful even with this in place? I genuinely can’t tell.

It feels like maybe we should be taking a step back, and agreeing on what the problems are here, before we start throwing out solutions. And this seems to be a general issue in the PyPI area - we already have organisation support, PEP 708, and TUF, all in various stages of completeness, and now we’re debating three more proposals. Maybe I’m being too negative, but I feel like we have a problem with seeing things through to completion here.

mikeshardmind · September 10, 2024, 12:14pm

I don’t think they help at all. They only encourage worse project names when community steps in to fill a need, and presuppose that just because a company is using a prefix it is also inappropriate for a community using a publicly available service to build tools around that service. They also don’t work retroactively unless you’re willing to kick the volunteer community even harder. They have an appearance of helping, at best.

Full namespacing (not the compatibility attempt with prefixes that needs to grandfather things, and which users then need to navigate grants and find out if any grants apply…) could help here if tools had access to details about who owned the namespace via api.

I don’t think most people actually care who is running the account, they care who is providing the code. Trusted publishing exists because there can be a gap there in provenance. In fact, I would argue that verification by only account without trusted-publishing is a bad trade in security with the tools we have and increases the value of account and token compromises if the guidance is “you verified the author, you’re good”

If the pypi apis were extended to allow tooling visibility about which service and which user of that service published a package, then if you trust pypi (which you would have to for any mechanism here to help), trusted publishing can help here.

This would forward existing security benefits of trusted publishing further down the supply chain, without increasing the costs volunteer project maintainers would be pressured to incur. (should a simple wrapper around an API be pressured by a company to go buy and set up domain for this if this becomes the norm, where trusted publishing is basically available for free for OSS projects?)

Beyond this, I don’t think corporations should trust pypi indefinitely, and should instead have an internal mirror, trusting pypi initially, but not for the continued existence of packages. This has benefits in both directions, lowering the bandwidth use of corporations and decreasing the reliance on packages remaining available. Barring that, they should set up hash verification. It’s within the reasonable threat model for a corporation that an index could be compromised. It also means that adding packages can be done by a central review process and that multiple projects benefit from such review, in companies that aren’t single-product.

pf_moore · September 10, 2024, 1:05pm

There’s a lot to be said for leveraging PEP 708 and encouraging organisations that want a private namespace to either set up their own index, or buy an area on a fully commercial “Package index as a service” offering. That would require the community to get PEP 708 implemented, and add the necessary UI changes to installers, but it’s a more scalable long term solution IMO.

See my earlier comment about seeing existing things through to completion

woodruffw · September 10, 2024, 2:52pm

I’d definitely strongly recommend having this be HTTPS only, and not reusing DNS-01 or HTTP-01 or other pre-secure-origin challenge mechanisms! Those mechanisms are great for Let’s Encrypt’s purposes, but the package index is in the fortunate position of being higher up on the stack; given that a secure origin (= HTTPS) should probably be table stakes for domain verification anyways, my suggestion would be to use the .well-known URI scheme (RFC 8615) and have a basic proof served from there.

To take a step back, this is indeed separate from Trusted Publishing, but it does dovetail closely with related efforts. TUF has already been mentioned, but it’s also pretty close to the PEP 740 attestation model (which currently assumes Trusted Publishing identities, but was designed to be extended to arbitrary identities, like emails and domains).

jamestwebber · September 10, 2024, 3:07pm

I was wondering about this throughout the other discussions. It seems like a third-party package index is a better option than namespaces on PyPI^[1]. I was confused about why that wasn’t being discussed so much in the namespace threads.

It seems like the answer is: because third-party indexes are already sorta-supported and used, but not sufficiently well-supported to solve the problem for many organizations. So solving this first makes a lot of sense to me, before other solutions are required.

simpler for an organization to administer, the namespace isn’t crowded, they don’t build tiers into PyPI, and more ↩︎

ofek · September 10, 2024, 3:41pm

I will add this to the rejected ideas section in PEP 752 today.

barry · September 10, 2024, 6:31pm

I also wonder whether this is metadata that should be associated with the organization, rather than the project. I can think of other data that should be associated with an organization, and should be available via the JSON API. So it’s a matter of indirection: check the organization associated with a project, then get a list of packages and prefixes owned by the organization.

I can’t tell whether there is an existing API for querying organizations. There does seem to be one for users^[1], though no JSON API. Organizations in general seem to be underspecified, so it’s not surprising. But if we’re going to be doubling down on the organization model, then exposing more organization-focused APIs would be required.

which live in a separate namespace from organizations afaict ↩︎

ncoghlan · September 11, 2024, 2:52am

The reason for making the assertion directly at the project level is so even the repository operators couldn’t lie about it by linking the organisation to other projects that the organisation didn’t actually own.

My understanding is that orgs are underspecified because they started primarily as a way of enabling cross-project role-based access control in Warehouse, and hence didn’t have much significance outside PyPI. PEP 752 is the first time we’ve considered doing more with them, hence it needing to define how to represent them in the repository API.

Interesting! Where Barry’s suggestion would allow verifiable authorisation assertions for users and organisations, the attestations would allow them to be added at the level of individual artifacts.

(For anyone else that missed PEP 740 when it was published and provisionally accepted: PEP 740 – Index support for digital attestations | peps.python.org)

I don’t think you’re being too negative, I think you have a genuinely valid point (it took me a bit of searching to confirm that the TUF work actually was still progressing, since the focus for the last while has been on building a separate set of services that can be integrated into PyPI, rather than building the functionality directly into the Warehouse code base).

I’m also reminded of the quote “Yes is forever, no is for right now”, as the big thing I’ve personally taken out of this discussion is to go from a mild +0 on the namespace prefix reservation idea to a firm -1.

Between potentially coming up with an ACME-inspired HTTPS-based protocol to allow organisations (and individuals) to assert control over projects and accounts on PyPI, and potentially leveraging PEP 708 to enable cases where PyPI is just a distribution platform rather than the source of truth for a package, I now think we have less disruptive, lower overhead ideas that should be explored before we contemplate making fundamental changes to the way PyPI handles registration of new project names.

If implicit or explicit namespaces within a repository still make sense after we’ve defined a way to use repository provenance records as namespaces (think name==version @ repository_url, with similar syntax to direct URLs but pointing at a repository rather than at a distribution artifact), then we could reconsider the idea then.

barry · September 11, 2024, 9:19pm

Actually there is, although it is undocumented, as by example:

$ curl https://pypi.org/org/certifi/ | less

I don’t believe there is a JSON API to this data (yet ).

It won’t be the last I have a PEP in the works which I hope to post next week that also builds on org accounts for some functionality.