PEP 458: Surviving a Compromise of PyPI

It took me a while to realize what I don’t like about PEP 458. It mixes the issue “How to surviving a compromise of PyPI” with a technical solution (TUF). It feels like the PEP is tailored for TUF without exploring alternatives or even verifying if the PEP is asking the right questions. TUF might be the only viable solution, but it’s impossible to gauge when the text is written as “Any security framework you like as long as it is TUF”.

I also feel like the PEP is not written for consumers of PyPI. It gets very technical very fast. Although I’m working in security engineering and deal with security on a daily basis, I find the PEP hard to read. Perhaps it would help if you split the PEP into a general, less technical and more user oriented PEP and a technical PEP for TUF+PyPI. @steve.dower did a really good job with PEP 551 and PEP 578.

I would like to see a general and user-oriented PEP about PyPI security to answer these questions:

  1. How is a package owner/maintainer able to verify that PyPI is serving correct and unmodified files?
  2. As a user of PyPI how can I make sure that pip installs correct and unmodified packages?
  3. As a user of PyPI how can I protect myself against typo-squatting attacks or compromised versions of a package?

Personally I see malicious content and package trust as a more pressing issue than a compromise of PyPI infrastructure. As a member of the Python security team (PSRT) I’m getting reports about typo squatting or malicious packages every week. (Fun fact: There we four email threads about malicious content on PyPI this month and today is just Dec 4.) There might be easier ways to detect a compromise of PyPI until TUF is implemented, e.g. PyPI could push SHA-256 hashsums of all files to a git repo or to an append-only database similar to Certificate Transparency?

5 Likes

This is also my concern (and I’m also on the PSRT, and have insight into Microsoft’s tracking of certain attacks).

As far as I’m aware, nobody has managed to corrupt PyPI’s data stores in a way that TUF would detect. But every day people are uploading/downloading packages with malicious content that are “legitimate” according to the metadata. Without a publisher attached signature, users have no way to determine trust based on who uploaded it.

The workaround most of us are going for is to create private PyPI mirrors with curated packages (on DevPI, Azure Artifacts or (IIRC) Artifactory). pip is slowly but surely getting the kinks out of its support for alternate indices and making this more viable, though many take the opportunity to switch to a multi-language package manager.

Fundamentally, package (and publisher) reputation is far more critical right now than Warehouse-internal tamper detection. And if there have been attacks that would justify this particular investment, it would be great to hear about them (even sent to just the PSRT).

4 Likes

Thanks for the clarifications, Sumana. I guess one thing I am asking then: When an important discussion like this is held on Discourse, occasionally post a link to python-dev to alert the core dev team and other interested parties of the process. I have no problem following a thread on Discourse, but I don’t regularly scan Discourse for new threads, and Discourse doesn’t send me useful emails. I do receive python-dev in my inbox and always scan it for important topics.

I think this is precisely right. In my view, the PEP process and discussions on Discourse and the mailing lists are a perfectly fine way to involve the community in packaging proposals (and I very definitely think the whole community should be involved - to the extent that they want to be - not just packaging specialists).

The problems here seem to be twofold:

  1. Discussions on a github PR are not part of the normal PEP process, and many people don’t track them. IMO, PR reviews should be considered as private conversations, and always summarised back to a more public forum.
  2. There was a fairly substantial discussion (as I understand it) at the packaging summit at Pycon. Face to face discussions like this have always been controversial - they are very high bandwidth, which is a huge benefit, but they are also very exclusive - even if the results are summarised and posted to the lists, the summary doesn’t give any sort of feeling of the “flavour” of the discussion, and so readers miss out on a lot of context and nuance. And to be frank, most face to face discussions aren’t even summarised very well, so really “you had to be there” is the norm.

Add to that the fact that this topic is of interest to a limited group with pretty specialised skills, and it’s really hard to find a good way to engage with the community. But that’s precisely why it’s important to try even harder to do so. And IMO, that’s where the PEP authors here need to improve things. We’re now in a situation where there’s available funds and a proposed implementation, and we’ll see a lot of pressure to “not let bikeshedding and debate get in the way of the opportunity” - whereas I think rushing something in without proper community oversight is actually worse.

As an example, @tiran made some significant and important comments here, and these deserve to be addressed. @steve.dower has similar concerns.

The history here really isn’t too important. No-one acted in bad faith, there’s just been a series of misunderstandings and miscommunications. But thankfully, we’ve exposed the issue and can now hopefully have a productive discussion here that will converge on a proper community-approved solution that addresses the issues. I trust that the PEP authors will engage with that discussion, and can be impartial in discussions over “TUF versus other solutions”, but if they feel that they have a conflict of interests and are unable to represent non-TUF solutions fairly, then I’m equally sure we can find other champions for such solutions. The PEP process has handled far more complex debates in the past, so there’s no reason to feel that it can’t handle this.

3 Likes

This confusion could easily have been avoided if the RFCs had been named more imaginatively.

To follow up some of the discussion here (and to jump around in order a bit to speak to the non technical stuff first):

While I’m one of the people in the other thread advocating to not use the PEP process, it’s got nothing to do with not interacting with the community. It was mostly my fault, I’m in the middle of a job search (well tail end of it now hopefully) and I simply got lazy (or more charitably, I was distracted) and answered inline in Github rather than asking to have it published here and answered here.

Most of the other discussion around this PEP did occur in the open, it was just quite some time ago and on the disutils-sig mailing list. I do not believe that we had an in person discussion about it at PyCon (other than people saying they wanted packaging signing, and pointing them to PEP 458) but I could be misremembering that.

No. This PEP completely transparent to package authors and to package consumers (except if it detects something going wrong, as it does add additional failure modes that may be exposed to end users).

In the general case, no (for pip yes but for different reasons).

No.

Fundamentally the impact to various groups (at least in this PEP) is roughly the same as going from HTTP to HTTPS was. For your average publisher/consumer of Python packages, it’s just something that the tooling handles for them, for PyPI it requires changes to the infrastructure and for a tool like pip it requires code to handle the new protocol.

It sort of isn’t I think? Random consumers of PyPI basically shouldn’t care about the details of this PEP anymore than they care about the details of how TLS works.

I don’t think a PEP like this is actually super useful, this feels to me more like something that should be documented either on PyPI or as part of packaging.python.org. I mean it could be documented as a PEP as well, but ultimately to me it’s not really asking for a proposal, it’s just documenting how to-s.

Particularly (3) isn’t related to PEP 458 at all, since it doesn’t seek to solve that problem at all (I could think of some ways to make it part of the solution, but it would b e apretty minor part overall). (1) and (2) the answer as far as this PEP is concerned is “you don’t have to do anything, it gets handled the same way TLS ensures that an attacker with a privileged network position can’t change the content of responses”.

While we could discuss the relevant merits of different problems to focus on solving, we don’t really get to dictate what exactly we have people willing to work on. This particular PEP was written by volunteers (If memory serves me correct they were grad students at the time) and wasn’t directed work from the PSF or the community. We don’t get to tell volunteers what to work on (unless they ask us), so we have to judge contributions as they come in, not what we’d rather they work on. So in terms of whether we ultimately decide to accept this PEP we don’t really get to decide the effort spent writing/discussing it would be better spent on something else.

We do have some funds that we plan on using to implement this PEP if it is accepted, and perhaps @sumanah or @EWDurbin could better answer this question, those funds were given to us with the understanding we’d use them to implement, among other things, “cryptographic signing” (though it was left open ended what exactly that entailed) so even in that case we have limitations on how we’re able to direct work to be done since part of it needs to be implementing a cryptographic signature scheme for PyPI (part of it is also implementing malicious package detection, but that doesn’t have a PEP because it’s just a new feature of the PyPI code base and doesn’t have ramifications for projects beyond PyPI really).

So AIUI, the current state of things is we’re funding both things, but we need to do both things, or we need to give back (or something? I don’t really know how it works) a portion of the money since we’d ultimately be deciding not to move forward with implementing a feature that satisfies the conditions it was given under.

So in terms of package signing we have 3 options:

  1. Discuss/refine this PEP it until it’s ready to be accepted and then move forward with it.
  2. Someone else volunteers to write a competing PEP and we discuss the relative merits of them both and choose between them.
  3. We do nothing and we let the PSF figure out what the fallout of not implementing package signing would be.

Personally I think we should do either 1 or 2, but unless someone steps forward to do (2) our only real options are 1 and 3 and since I both agree with the idea of getting packaging signing onto PyPI and that TUF is, to me, the best option i’ve seen I’d love for that answer to be (1) (barring someone coming forward to do 2 and coming up with something even better)

PyPI? Not yet, other similar repositories? Yes they have (see for example Rubygems).

There’s also a wider impact to this as well. We currently rely on TLS to ensure that the bits that a user gets is what PyPI is trying to serve them. That works reasonably OK if a user is talking to PyPI itself, but what about mirrors of PyPI? Right now there’s no mechanism in place to ensure that if you ask for “foo-1.0.tar.gz” from a mirror of PyPI, that you’re actually getting what PyPI thinks foo-1.0.tar.gz is. This is a big problem in places like China where access to PyPI is incredibly slow (I’ve had private reports of pip install taking days to complete if you’re not using a mirror there, just due to the bandwidth limitations of trying to go through the great firewall). Currently the only real solution for those people is to either suck it up, or install from a semi random mirror inside of the great firewall. We could try to have the PSF put a presence inside of China, but that only solves the problem for China not for other locations, and specific to China I believe the legalities of doing that are kind of tricky (but that’s a better question for the PSF itself, not me).

Like I said above, this isn’t really related to this particular PEP, unless your argument is this idea isn’t useful and we should only focus on that.

2 Likes

I’m not trying to make the argument now, just wanted to add the context for the new participants that this was the argument I was making when (allegedly) we hadn’t decided to fund this PEP.

I am concerned that we need to be clear that this feature doesn’t help protect against legitimately uploaded malicious packages, which is currently 100% of the malicious packages we’re aware of. As long as that’s communicated clearly (including to all the media sites who will no doubt write this up), I’m fine.

The last thing I want is for people to get the sense that PyPI security is “solved” and stop seeking to invest in it (and I don’t think this is overblown - I’ve heard people legitimately call PyPI “fundamentally insecure” because of a typosquatting attempt).

1 Like

We have limited ability to control how other people communicate this potential change but I know I certainly don’t consider this “solving” PyPI security (or that security is something that can be “solved”, it’s like saying performance is “solved” you can move the needle in one direction or the other or solve specific issues, but you can’t, as a broad statement, “solve” it).

I’m personally unlikely to be the one doing much communication around this, but I do think it’d be great to call out in the PEP itself that this particular problem is out of scope for this PEP (not that it would be an exhaustive list of what’s out of scope, since there is lots of security related work that this PEP does nothing for, but since that is a pretty big topic and there’s a decent chance of confusion, it would be great to be explicit in the PEP).

2 Likes

I also do want to be explicit that this PEP isn’t a forgone conclusion. While I do currently believe the premise of it is a good idea, if others disagree with that I want that discussion so we can figure out if it is in fact a good idea or not.

2 Likes

That’s precisely the sort of thing the PEP should make clearer.

To be clear, I have no problem with doing (1). But that doesn’t mean that the PEP shouldn’t be sufficiently clear to allow non-technical users to read and understand enough to know what it’s providing (to use your analogy, I don’t know how https is implemented, but I know what it’s for, what it protects against, roughly how it does it, and importantly, what it doesn’t protect against - the PEP should give the same level of understanding here).

There’s a somewhat new situation here that we’re having to navigate. We have got some volunteers, we’ve got some money to let them do what they propose, but we still need to ensure (as a community) that we want what they are offering, and someone is willing to pay for. Having known community specialists like yourself support the proposal is a good step in that direction, but it’s not the whole story.

Some other things that are typically covered in a PEP but which are missing here:

  • Review of how other ecosystems handle this issue. This data integrity issue isn’t unique to Python. How do other languages (rust, javascript, ruby) and distributions (Red Hat, Ubuntu, Homebrew, Microsoft (nuget)) handle it?
  • Discussion of how “PyPI consumers” should implement this. In view of our principle that we avoid implementation defined behaviour, I’d like to see an explanation of how a tool that wants to consume data from PyPI would implement the consumer end of the protocol. Presumably in terms of using the TUF library from PyPI. I don’t think it’s acceptable to expect tools to copy pip’s implementation. (An obvious example of a tool would be distlib, and we have a goal to make it easy to write new standards-based tools, so we should take that into account).

I’d also like to see the PEP title changed, as it’s currently basically meaningless. Something like “Implement (whatever it is we’re implementing) for PyPI using TUF” would much better explain the proposal - and would make searching for the relevant PEP a lot easier as well!.

PEP title: I suggest “Secure PyPI downloads with package signing” which gets across:

  • this is about downloads (not uploads)
  • this is about PyPI (not other tools)
  • the way we do it is with something having to do with the signing of packages

Trying to keep the PEP title short but descriptive, how about “Improved PyPI download integrity via signed packages”?

The problem I have with “signing” is that it implies authors will do the signing, which is precisely what we want to avoid. To use @dstufft’s analogy again (which I like a lot) HTTPS uses signing, but internally - to the end user it’s not about signing at all, but about secure communication.

1 Like

I’m a coauthor of PEP 458 and a maintainer of TUF. I wanted to weigh in here on some of the issues raised about the PEP in this discussion, and present some ideas for moving forward.

First, I want to note that while I am a maintainer of the TUF project, my goal here is to enhance the security of PyPI. While I think TUF will help achieve the security goals specified in this PEP, if the community has other ideas or would like to pursue alternative strategies, I would be happy to contribute to and support those as well.

It seems that the title and introduction to the PEP are causing some confusion about the purpose and scope of this PEP, and that some of the prior discussion about this PEP did not make it into the text. To that end, I have some suggestions for areas to discuss here and in the PEP:

  • Change the title: We need something that better reflects the contents of the PEP. I suggest something like “Supporting end-user verification of PyPI Packages”, but am open to other suggestions.
  • Change the introduction to include a brief description of the problem the PEP solves, and what entities the document affects. As mentioned earlier in this thread, this PEP solves the problem of making sure that users get the specific packages that PyPI serves (even if they use a mirror), and providing some protection in case PyPI is compromised. These changes will affect only the PyPI infrastructure (not package maintainers). In addition, they allow PyPI consumers like pip to verify packages. Though most of this information is already in the PEP, the project goals are not clearly stated at the beginning of the document, and this should be addressed.
  • Include a description of how a client can interact with the solution proposed in PEP 458.
  • Contrast the proposed solution with other solutions. We need to explain early in the document how TUF is different from other solutions like TLS and GPG signing.
  • Consider including an out of scope section. There was some discussion here about the issue of typosquatting packages, and there are many other security issues this PEP does not cover. Rather than a long and incomplete list of what this PEP does not do, I would prefer a clearer definition of what the PEP does do as described above. However, if others think an out of scope section is necessary, please suggest what content belongs in this section.
6 Likes

First, let me acknowledge my conflict of interest, having written the PEP, and since I am also one of the lead researchers and developers for TUF.

Having said that, I cannot think of any other system that provides the level of security that it does while preserving usability for developers. We have put years of thought into this, especially as we learned from experience.

Other solutions (such as Certificate-Transparency-like solutions) may be simpler from a usability point of view, especially as there is literally nothing for developers to do (just like this PEP, actually), but it does not provide any trustworthy information whatsoever as to who wrote the original source code, and who or what built the source code into packages.

OTOH, you can combine TUF with other solutions to provide end-to-end guarantees that your package was not tampered with anywhere between, say, Django developers and end-users. Indeed, this is what we have done with the Datadog Agent integrations.

We thought that it would be good to see the same level of usable security on PyPI, which is why we originally wrote this PEP. Let us know if you have more questions.

1 Like

Thanks for summarizing these points @mnm678!

Does anyone have anything to discuss regarding the suggestions? I notice that there have been a number of “Like” responses from those who have raised concerns, but those don’t seem particularly actionable :slight_smile:

As someone who “liked” that post, my intention was to express support for the idea of doing those things. I don’t have a proposal as such on any of them, I’d like one of the PEP authors to respond with a proposal that can be considered and discussed. To be honest, as @mnm678 is one of the authors, I was hoping that the number of “likes” would encourage her to do so.

I was sort of hoping you’d provide answers to the questions already posed, before we started asking further questions. That way, the discussion would feel more like a dialogue, which is honestly where I currently feel the process is failing here.

I’m getting very frustrated at this point. There has been a lot of feedback on this thread, and little or no response from the PEP authors (other than @mnm678’s post, which was a good start but @EWDurbin’s follow-up seems to imply that it’s not going anywhere without more input from the people who have already commented :slightly_frowning_face:). There still seems to be this misunderstanding that the response to feedback should be a revised PEP - and that’s absolutely not the point here, we want a discussion first with consensus before the PEP gets updated.

Honestly, as things stand I don’t feel that I can support this PEP. Not for any technical reasons, but simply because it’s failing to follow the basic principle of community consensus. Obviously if the authors are ignoring the need for consensus, they can also ignore my views, and I guess there’s not much I can do about that. But it seems a shame if we can’t reach agreement here somehow.

1 Like

I am not a PEP author, yet I do think TUF is what we need.

As a bystander on the conversation, I’m somewhat confused as to what you mean with this: do you either mean that the suggestion should be somehow materialized somewhere before or after community consensus? Are you not supporting the PEP (RFP?!) because you don’t feel there’s consensus (yet?). It appears to me — again, only an outsider — that the process is somewhat unclear even to people inside of the community, and thus I’d expect more leniency. I’m hoping that you, as a more senior community member, could perhaps help with specific pointers on how to help drive community consensus (be it pro or against)?

As far as I’m aware, here are the questions (apologies if I missed any):

  • Does TUF help against typosquating? (or any other signing solution for that matter?)
    no it doesn’t, as far as I’m aware.
  • Should the name reflect that?
    definitely, I think the authors could clearly outline the limitations of this and arguably any other repository/package signing /integrity-solution.
  • Does this mean that developers must sign all packages when uploading to PyPI?
    as far as I’m aware, no. That’s part of what’s discussed on PEP 458/480. Those who do not want to manage their keys will have packages signed by an automated element running in the PyPA infrastructure (again, I’m not in a position of authority to clarify this)
  • Review how other ecosystems handle the issue (also grouped up with "this has never happened to PyPI as far as we are aware).
    I think there’s quite some literature review around it, as well as other successful implementations in other ecosystems. What exactly where you thinking would pass the bar of “enough of a review”?
  • Backwards compatibility with non-signing solutions.
    I think @dstufft’s HTTP metaphor pretty much outlines how this would behave :slight_smile:

I believe it’s possible to work out a plan to drive community consensus around those questions. What I’m not completely aware of is how to actually turn this into something that’s actionable.

edit: for some reason I was trying to format this and got sent, apologies.

Hey @pf_moore. I think – to second what @SantiagoTorres just said – the PEP authors here are genuinely trying to work with the community here and build consensus, and have benefited from and could use more mechanical guidance on how to do that, because their previous attempts evidently didn’t work quite right. (Examples of such guidance: “do as much work as possible in discussions on Discourse/mailing list, including things like suggesting specific textual changes and reviewing/asking for review, even though we’re generally used to doing that work in GitHub pull request reviews when we work together on code.” “When a community member asks a question or shares a criticism of the existing proposal, it’s better to have a back-and-forth conversation with lots of little iterative questions and answers, instead of the proposal author coming back with a multi-paragraph essay or proposal revision. Yes, even if that feels spammy to you at first and means the thread ends up being hundreds of posts.” Are those right? I’m inferring. PEP discussion experts – are these right? In particular, is there newish updated “what to do where” guidance on PEP stuff that reflects our new Discourse reality?) And here’s another example:

“Liking” a post on Discourse does not have clear semantics. :slight_smile: It could mean “I like how you expressed this” or “I’m glad you spoke” or “welcome” – it does not clearly mean “yes, please do the things you have proposed.” So now I am glad that you have explicitly said to @mnm678 that you want her to go ahead and expand on those items.

In the absence of @dstufft as BDFL-Delegate explicitly guiding the discussion here, I suggest that @mnm678 go ahead and share – here in the thread, not as a pull request – some thoughts per her previous post, especially/starting with:

1 Like

I will aim to answer the current questions in this post. In addition, I will start a list of action items and proposed changes to the PEP based on the discussion in this thread. These are meant as a starting place, so feel free to give feedback and discuss.

Here are my answers to some of the questions from the thread. It’s a long thread, so please let me know if I miss any pressing questions.

No. PEP 458 and TUF solve a different security problem (making sure that users get the packages that are hosted on PyPI)

Yes. There was some discussion earlier in the thread about options for new names for the PEP. I think more discussion here is needed to form a consensus.

PEP 458 does not provide any mechanism for developer signing of packages. It adds signatures after packages are uploaded to PyPI. An extension of this work (such as what is described in PEP 480), would allow for developer signing to further extend the chain of trust (developer to PyPI as well as PyPI to user)

Here is a current list of action items:

  • Change the title: This needs more discussion. Some earlier posts with various proposed titles were a good start, but there doesn’t seem to be consensus yet. Hopefully more clarification about the goal of the PEP will help here.
  • Add a couple paragraphs to the beginning of the PEP (probably the abstract) to give an overview of the PEP. Here is a draft of that section as a starting place:

Attacks on software repositories are common (https://github.com/theupdateframework/pip/wiki/Attacks-on-software-repositories). The resulting repository compromises allow attackers to replace popular packages with malware that looks like the original package. In addition, an attacker on a repository can use any online keys (like those provided by TLS) to validly sign arbitrary packages. These and other attacks on software repositories are detailed here. This PEP aims to protect users of PyPI from malicious packages and to provide a mechanism to recover from a repository compromise.

To provide compromise resilient protection of PyPI, this PEP proposes the use of TUF. TUF provides protection from a variety of attacks on software update systems, including arbitrary software installation and rollback attacks, while also providing mechanisms to recover from a compromise. It does this by generating signed metadata using threshold signatures and offline root keys that can be used to verify the accuracy and timeliness of software packages. More details about TUF are included later in this PEP and in the specification.

This PEP describes the changes to the PyPI infrastructure required to produce TUF metadata. These changes should have minimal impact on other parts of the ecosystem. The PEP focuses on communication between PyPI and users, and so does not require any action by package developers. Developers will upload packages using the current process, and PyPI will automatically sign these packages. In order for the generated TUF metadata to be effective, additional work will need to be done by PyPI consumers (like pip) to verify the metadata provided by PyPI. This verification can be transparent to users (unless it fails) and provides an automatic security mechanism. There is documentation for how to consume TUF metadata in the TUF repository. However, changes to PyPI consumers are not required, and can be done according to the timelines and priorities of individual projects.