PEP 458: Secure PyPI downloads with package signing

SantiagoTorres · January 27, 2020, 9:03pm

I think @mnm678’s reply already covers that, but I wanted to underline that it’s oftentimes the case that the transport is the one that can provide compression (e.g., by using HTTP Content-Encoding), and thus pre-compressing the metadata at rest is becomes redundant. Further, a compressing filesystem like btrfs could help with metadata sizes at rest.

rbtcollins · January 27, 2020, 11:10pm

There are a couple of bits of confusion here. Content-Encoding is not a transport compression feature, it is a representation compression that preserves the underlying media type. Transfer-Encoding is a transport compression feature. But perhaps thats not a distinction thats important for your point?

Generating Content-Encoding representations on the fly is computationally expensive, and doing that for static resources is a poor practice for a number of reasons; for static resources (such as write-once metadata files) they should be generated once when the primary resource is generated and removed when it is removed.

Depending on FS features like compression is putting the burden of performance on our users and mirror operators. I don’t think thats a good choice.

SantiagoTorres · January 28, 2020, 12:06am

Right, my bad. Content-encoding would mean that indeed the client needs to manage decompression, which then would have to be supported by the client. Nevertheless, I still think that that would allow us to save a couple of bits on the implementation side.

This is true, although my understanding is that many web servers nowadays are able to cache content they compressed before (I know that at least lighthttpd’s mod_compress can do so), so the performance hit wouldn’t be too bad no?

My broader point is that compression as a space/bandwidth saving feature is an aspect that has been widely explored within http. Thus, it may be a good idea to handle it at that level then.

This is true, after thinking again about it, it would also probably make it a little less evident for everybody involved as well…

rbtcollins · January 28, 2020, 9:09am

That issue has some interesting reading, but it makes me question TUF’s history rather than question the idea about compressed metadata :). There are considerable design choices TUF made in the past (e.g. “Clients may request either form of particular metadata.” leading to “Snapshot file becomes unnecessarily large in size because it must list both compressed and uncompressed forms of the same metadata.”) which are not the only available choice in a compressed metadata space.

And the class of attack being defended against: processing unsigned data, must surely already be guarded against by TUF: if any TUF code was processing arbitrary JSON before checking signatures on the files, then it is already at risk should someone find an effective attack against the JSON decoder in $language - there have been attacks before, and probably will be again. Such as CVE-2017-12635 for instance.

Beyond perhaps the root document giving the initial keys that is (bootstrapping is hard) - but preferably even that would never be processed except after verification of the signature. There were vulnerabilities in apt around this space as well in the past, where clearsigning of documents was incorrectly handled, and content outside of the signed body would be accepted. Fixing this class of design problem would mitigate all(well ignoring trivia like ensuring a canonical form) of the listed issues with compressed metadata that I can see: and not fixing it means that TUF has a pretty large attack vector waiting to attack it, unless the compressed metadata layer was just poorly integrated, in which case integrating it properly should again take care of this.

Nevertheless, I think what I’m hearing is ‘the TUF libraries cannot do compressed metadata because TUF cannot, so if compressed metadata is important here, this is a chunk of prep work that needs to be done’. Is that right?

Sidebar: For an internet protocol, which TUF effectively is, JSON is pretty much the worst of all possible worlds: verbose and wasteful of bandwidth, no precise numeric type, no intrinsic canonical form.

There are many serialisation formats with better properties that TUF could adopt instead (such as BSON, Protobuf, flatbuffers/capnproto and so on, many of which would require much less space to start with, though the string components would still be amenable to significant entropy coding for further reduction in disk space. Though, the lack of canonical form in most of them would mean setting pretty hard behavioural constraints on things - in my last job at Cachecash we ended up writing a protobuf-alike simply because protobuf has no canonical form, and we needed to be able to generate a canonical form to sign documents (blockchain…).

SantiagoTorres · January 28, 2020, 3:10pm

I think you are misunderstanding here: compression happens first, then the signature check, hence the compression vulnerabilities being mentioned. We could sign the compressed representation as well, but that’s not a common practice (hence the compression vulnerabilities being an issue).

I’m sorry, but I think there are a couple of more misunderstandings here. From the assumption that TUF is an internet protocol, to ignoring that JSON is probably the second most widely used standard for messaging over HTTP. Isn’t the warehouse API itself JSON (plus some legacy endpoints and xmlrpc?)

Protobuf is probably the worst we can adopt in terms of security. We’ve explored BSON in the past, yet there were no mature implementations earlier (also, I don’t think there’s a standardized way to sign bson buffers). The rest I believe are covered in Uptane (an IETF-standardized variant of TUF for automotive systems), yet I still don’t understand why JSON is such a bad idea if it’s being widely used everywhere else and within this ecosystem.

Leaving the asides aside. Yes, there is a little bit of prep work that can be done in that regard on the client side. I’m curious though, are simple indexes compressed at all? or how did this even become an issue? Would it be worthwhile to add compression support for simple indexes as well then?

trishankatdatadog · January 28, 2020, 3:53pm

I concur with Santiago. While it’s a good idea to optimize when possible, I don’t believe we should prematurely optimize just yet. In any case, we can always add compression to the TUF reference implementation w/o too much trouble.

trishankatdatadog · January 28, 2020, 4:13pm

Thanks for raising this concern. We have indeed thought about this. For example, see this issue. However, I should also note that we have been through three independent security audits where I don’t think this was raised as a significant concern.

mnm678 · January 28, 2020, 7:26pm

I think that this is the main question. Compression will require some extra prep work. Is it important enough to warrant this extra effort? It would be nice to get some feedback from maintainers of PyPI/pip about how much of an issue they foresee in this area. The size of the metadata is discussed in the PEP, and is estimated to be about 1.5mb for a new user or 100kb for a returning user. @pradyunsg @cooperlees @EWDurbin

EWDurbin · January 28, 2020, 7:44pm

Catching up here, Is this discussion around HTTP compression or something internal to TUF?

We currently have support for HTTP compression across the /simple/ index via the Accept-Encoding header. I’d have to check but I believe only gzip is supported.

Assuming TUF metadata files benefit from this compression, is the question wether or not we’d enable this compression when serving the files?

mnm678 · January 28, 2020, 7:47pm

I believe the discussion is about compression internal to TUF. If HTTP compression is already supported, this may prove redundant.

rbtcollins · January 28, 2020, 7:55pm

Do you mean decompression, then the signature check ? If so then yes, I do understand, that being my point, that TUF’s previous compression implementation layered compression(integrity(data)), rather than integrity(compression(data)). As to whether it’s a common practice or not - I’m not familiar with any code review research either way Certainly for any code dealing with untrusted inputs it is a risk to layer anything other than an integrity validator at the outermost layer, and that is (or at least it is my understanding that it is) common knowledge.

So a protocol is just a set of rules governing the exchange or transmission of data between devices. When those rules are sufficiently close to the common behaviour that one can wave their hand and say “its a JSON API”, then it is not a new protocol - there is no new handshake, no new semantics, no new behaviours - though there will be new data (one hopes :)). TUF has a number of new expecations including e.g. the use of Canonical form, no use of floating point numbers in JSON, document linking by schema identified fields. All of this is fine, but to understand it and operate on it, a set of rules is needed - this makes it a protocol.

Warehouse follows regular JSON API conventions more closely (not to the level of regularity of e.g. JSON-HAL), but certainly to the level that the JSON API is provided without much of an operating manual. Is warehouse a protocol? Maybe. I’m not sure ;).

Most of Protobuf’s security properties would only be relevant in a compression(integrity(data)) construction, which is obviously an insecure construction. (The wiggle bit there is the lack of canonical form - thats what drove us to write our own thing, but I digress)… this isn’t a TUF discussion forum, so I’m going to leave this aspect of the discussion here. I will say I’m really surprised at it all, and leave it at that :). We uncovered this class of issue in repositories in the mid 2000’s, so to have it not be taken into consideration in the design of a newer system just threw me for a loop.

Compression came up for me because of my background in Ubuntu and Debian. Repository security there is a much simpler scheme, effectively just a merkle tree to the root signed by the server (+source packages signed by developers). Total security overheads per client are kept very low, partly due to the single large index (which incremental updates exist for). So a 5%-10% increase in download size plus double the network objects to retrieve seems likely to have a substantial impact in client performance. Perhaps I’m wrong! Is there some way to tell?

pip new installs are unfortunately a common case - CI systems in particular tend to download everything fresh every time.

SantiagoTorres · January 28, 2020, 7:59pm

This is exactly what I was originally trying to bring up. I think @rbtcollins and I have digressed way too much…

mnm678 · January 29, 2020, 9:43pm

There’s been a lot of discussion, so I wanted to collect the open items to make sure we address any outstanding issues.

Back signing: There was consensus that this would be a good idea, as long as back-signed packages are differentiated. As this will take place outside the SoW for the PEP (according to @woodruffw), where should it be documented? (cc @joshuagl)
Should the pep define compression for TUF metadata? There was lots of discussion here, but I think the first question is whether TUF metadata compression is necessary. HTTP compression is already supported in PyPI. Does this resolve the concern?
There were a couple of concerns about the TUF reference implementation that could use some more discussion on the mailing list/issues so that others in the TUF community can weigh in, especially discussions about JSON encoding and built-in compression support. These questions are important for the reference implementation of TUF, but do not affect the specification recommendations in this PEP. (https://groups.google.com/forum/?fromgroups#!forum/theupdateframework)

FYI: some (mostly minor) changes to the PEP were collected in a PR: https://github.com/python/peps/pull/1284

ncoghlan · January 29, 2020, 10:21pm

These have been merged now. The most notable change was to move away from recommending that the PSF Board be the offline key holders, to instead recommending that they appoint the key holders (either directly or by delegating the task to the PyPI admins).

The rest of the updates were just clarification and consistency edits, and the addition of a background link to the CNCF post.

joshuagl · January 30, 2020, 10:45am

I think we should address this in the PEP, as the PEP “proposes how The Update Framework (TUF) should be integrated with the Python Package Index (PyPI)” and back-signing with differentiated signatures is part of that proposal.

In terms of what we’d recommend in the PEP, one approach would be to have an additional targets delegate for retroactive-bins/back-signed-bins with associated retroactive-bin-n roles. Are there alternative approaches that would make more sense?

joshuagl · January 30, 2020, 10:48am

Looking over the PEP today I noticed that we have Figure 1 and Table 1, which both describe TUF roles and responsibilities. The contents are almost the same, only the description in Table 1 for the targets and snapshot roles differs slightly from the text in Figure 1 and feels like it has been refined.

Shall I submit a PR removing Figure 1 and referring to Table 1 instead? cc @mnm678 @trishankatdatadog

trishankatdatadog · January 30, 2020, 4:24pm

Sure! My only concern is that we should work to get the PEP approved, which means that @dstufft, @ncoghlan, and the rest of the community approve. Maybe we can add this a bit later as an addendum?

trishankatdatadog · January 30, 2020, 4:24pm

Yes, please!

joshuagl · January 30, 2020, 5:07pm

Although now I’m confused. The Figure 1 I see when I browse to the PEP is a rendered version of the table:

but the image with the same filename on GitHub is a diagram:

Do we actually want to embed the latter image as Figure 1 and the PEP publishing process got confused by the file rename?

mnm678 · January 30, 2020, 5:24pm

Yes, the second image is the one intended to be in the PEP. I can try to investigate how the publishing process used the wrong one.