PEP 763: Limiting deletions on PyPI

woodruffw · October 28, 2024, 4:16pm

This is the discussion thread for PEP 763: Limiting deletions on PyPI

PEP text: PEP 763 – Limiting deletions on PyPI | peps.python.org

Previous discussions:

Summary

This PEP proposes limiting when users can delete files, releases, and projects from PyPI. A project, release, or file may only be deleted within 72 hours of when it is uploaded to the index. From this point, users may only use the “yank” mechanism specified by PEP 592.

An exception to this restriction is made for releases and files that are marked with pre-release specifiers, which will remain deletable at any time. The PyPI administrators will retain the ability to delete files, releases, and projects at any time, for example for moderation or security purposes.

barry · October 29, 2024, 1:10am

An important issue that this PEP does not cover, is the effects on file and project quotas on the occasional need for file/release deletions. Some projects have large wheels and bump against these quotas. When quota increase requests are delayed or rejected, and because there are often no good alternatives (including known issues with --extra-index-url, lack of PEP 759 type external wheel hosting, etc), a mitigation that allows releases to happen in a timely manner includes deleting old files. File deletions are never a first choice, but sometimes given the current state of the ecosystem, it’s the only viable option.

At the very least, this PEP should acknowledge and address this (mis-?)use^[1] case, within the context of other initiatives for handling large projects in the ecosystem.

but practical and effective ↩︎

woodruffw · October 29, 2024, 1:24am

Thanks for bringing that up @barry!

I knew that quotas/the existing need to delete files to remain under them was going to come up, but I wanted someone to raise it before starting to work it into the PEP itself .

In terms of how the PEP should address this, I had two initial ideas:

The PEP could stipulate its current policy for limiting deletions (i.e. “<72h or pre-release”), but condition its rollout on the completion of a separate effort to limit the current observed need for deletions. In other words: the PEP would describe how PyPI will limit deletions, but the actual for limiting deletions would remain disabled/admin-flagged until a larger fix for the quota problem was settled on.
The PEP could define an escape hatch for the deletion requirement, and specify that PyPI’s admins will temporarily allow-list projects that currently have a need for deletions that isn’t satisfied by periodic quota increases.

Of these two I lean mostly towards (1), since this feature would likely live behind an admin flag for a while anyways (like many big changes to PyPI do). But I’m curious what others think.

ncoghlan · October 29, 2024, 3:15am

Is it really an either/or rather than a “Why not both?”.

As you say, we’ll want the site global admin flag regardless (as part of PyPI’s regular rollout process for major changes), and the per-project flag means the admins will be able to turn it on for new projects, while leaving it disabled for existing projects that are repeatedly flirting with their quota limits.

Without the per-project flag, the feature couldn’t be turned on for any project until the quota management issues were resolved, which doesn’t seem like a desirable outcome.

woodruffw · October 29, 2024, 3:58am

That’s a good point! With that in mind, maybe a staged process makes the most sense?

First, all new projects created after a certain time have the deletion rules applied to them;
Then, all pre-existing projects below either a relative (50%?) or absolute (<2.5GB?) quota usage have the rules applied;
Finally, once a stable quota management solution is in place, all projects have it enabled (and all project-specific flagging goes away).

This has the benefit of an incremental rollout, although it makes it slightly harder to communicate the change to users (since different groups/projects will receive the change at different times). I suppose that’s not too different from what happened with MFA, though

pf_moore · October 29, 2024, 9:10am

I’m not a huge fan of accepting a PEP on the condition that some as yet unspecified, and potentially not even possible, piece of work is completed. If that’s necessary, why not simply hold off on submitting the PEP until the conditions for its implementation are met?

We’ve had a number of PEPs which have been accepted conditional on some factor, which have then ended up in limbo for an extended period. It’s never been a great situation, and I think we need to learn the lesson not to do that.

barry · October 29, 2024, 10:03am

I don’t think this quite gets us out of the box, because it could limit new, related (or refactored) projects with potentially big wheels.

If we can solve the “big wheel” problem in a workable way, then maybe a global no-deletion policy would work. I’ve proposed one possible solution.

woodruffw · October 29, 2024, 1:38pm

I should qualify: my read of the discussions around PEP 759 and other solutions to quota management is that there isn’t consensus among the PyPI admins that a technical solution is needed here. That’s in contrast to a technical solution to limiting deletions, which there does appear to be rough consensus that limiting deletions is needed.

My thinking following that was that (1) PyPI has mature feature/admin-flagging support and can test admin-flagged behavior, and (2) these are conceptually independent problems, meaning that one shouldn’t block the other (even if, as a matter of policy, some kind of decision needs to be made around quotas before deletions can be restricted index-wide).

I agree this is a risk, although WDYT about the proposed rollout phases in PEP 763: Limiting deletions on PyPI - #5 by woodruffw? As-is that leaves the “final” stage until after some final decision about quotas, but it could also be rephrased to “all projects eventually have deletions restricted by default, unless included in the deletions-allowed list.”

That would re-frame the problem away from quotas entirely, meaning that this PEP’s completion would be disconnected from quotas.

That’s true. My thinking with that was that is that some new projects might hit that (going by Dustin’s number in the PEP 759 discussion, fewer than 0.01%), but that those projects could still go through the existing quota management controls to get their allocations bumped. And, worst case, the admins could always allow-list those projects for continued deletion.

(In other words: there’s still an admin/index operator in the loop doing triage, but I think the overall degree of human involvement needed will go strictly down: the ratio of projects that need quota/deletability changes will be unaffected, while the amount of accidental/undesired package deletion should go down.)

pf_moore · October 29, 2024, 2:44pm

I think that making deletions require admin approval would be just as much a burden as needing admin approval for a quota increase. So if admin bandwidth is the issue here, this won’t help at all.

Serious question - why can’t we simply remove quotas altogether? Or exclude yanked files from counting towards a project’s quota?

Basically, limiting deletions exacerbates the issue some projects have with quotas, so it’s hard to realistically evaluate the impact of limiting deletions if we don’t have a better understanding of why quotas exist, and what problems they cause. From the outside, all I can see is that some projects are limited in what they can upload by quotas, and getting quotas increased is problematic because it either takes too long or the requests get rejected. Limiting deletions removes the only resolution for this problem that those projects can use independently of permission from a 3rd party.

brettcannon · October 30, 2024, 8:37pm

12 posts were split to a new topic: Quota limits on PyPI

woodruffw · October 29, 2024, 3:40pm

I agree that it’s the same amount of burden w/r/t that demographic of users. But the overall admin involvement level should strictly decrease slightly, since fewer projects eligible for deletion means fewer accidental deletions that need to be undone.

(This is arguably a rounding error on overall admin time spend on triage, so it’s understandable if it’s not a strong motivating factor. But it’s worth noting IMO!)

Great question that I wish I had the answer to . Maybe @miketheman has some thoughts here?

Speculating: removing quotas entirely would probably make it too easy for people to abuse the index as a source of storage, and would also misalign the incentives around keeping packages small (IIUC, ballooning distribution sizes cause issues for pip as well as cache and mirror operators). But excluding yanked packages from the overall quota count seems like a reasonable middle ground.

jeanas · October 29, 2024, 7:52pm

Sorry if this was already discussed, but: IIUC the current limits are per file and per project, isn’t it conceivable to have a limit per period of time? Which might help relieve projects with a long release history behind them while still keeping the storage under control?

woodruffw · October 29, 2024, 8:15pm

I don’t think this was discussed! From my perspective that is technically feasible, but I don’t think it would limit the abuse vector much: as with the current system, someone who wants to abuse PyPI for storage could fan their uploads across several accounts. It’d also be hard to find a “sweet spot,” since package release activity is spiky (nothing for a while, and then a bunch of distributions for a release all at once). This can already happen due to quotas, but it wouldn’t be great if we added another failure mode where a critical bugfix release can fail to upload because the uploader has exhausted their current window’s upload quota

I think lifting quotas for verified organizations is probably the most tractable and suitable solution to the quota problem: it removes a set of failure modes during upload, and would encompass the most common current set of needs for deletions.

LtWorf · October 29, 2024, 11:58pm

How is this compliant with GDPR and right to be forgotten?

Most projects will contain a name and email of the author at the very least… which is personal data.

Is this proposal legal?

jamestwebber · October 30, 2024, 12:26am

The administrators can still delete files upon request.

woodruffw · October 30, 2024, 3:17am

I don’t think I’m qualified to make a legal judgement, but I’ll answer from particularity: there’s evidence that the majority of packaging ecosystems don’t consider limiting user deletions to be a serious risk w/r/t the GDPR. So there’s abductive evidence that administrator-only deletions don’t pose a GDPR problem.

steve.dower · October 30, 2024, 8:38pm

Since the quota discussion got moved out, I’ll say again that I think this proposal is not acceptable as long as user-initiated deletion is the only way to immediately unblock quota limitations.

pf_moore · October 30, 2024, 8:50pm

Agreed. I don’t think the quota discussion should have been moved to another thread, TBH, as it’s essentially the main issue with this proposal.

To put it another way, I think this PEP needs to answer the question “how would a project that is having issues because of quota limits handle those issues if deletion is no longer available (or at least, also blocked on needing an admin action)?”

woodruffw · October 30, 2024, 9:13pm

Thanks both. I understand (and agree) with that position w/r/t to the PEP. To take it back to a concrete proposed solution: assuming that PyPI begins approving organization accounts in the medium term (and that the majority of deletions-because-of-quota issues can be sidestepped by raising or eliminating quotas for verified orgs), do you two feel that that would be an adequate solution to the quota problem?

The reason I ask is because I’m trying to figure out how much language the PEP should dedicate to the problem of storage quotas, which is essentially an entirely PyPI-specific policy problem that might become mostly moot once verified orgs are being approved at a steady rate. If that’s the case, then my personal inclination here is to freeze this PEP until PyPI makes those changes and continue along with less emphasis placed on deletions, since the problem will be mostly ameliorated as a matter of policy.

If OTOH you don’t feel this would be an adequate solution, then I’ll work in additional language explaining the problem in depth and go back to the drawing board on tractable solutions.

steve.dower · October 30, 2024, 9:45pm

Verifying a contributor/org is basically what happens when an individual publisher makes a quota request, and historically that hasn’t been sufficient (at least in my experience, based on a couple of teams at $work who hit the quotas either per-file or per-project). I think getting turnaround time on requests down to under a day would be sufficient - as it stands, my $work teams can’t actually expect to delete files quicker than that anyway, as they’re slowly losing interactive login to PyPI and will have to go through another internal team for deletions.

But more fundamentally, I’m also not convinced that the damage done by a package disappearing (which can be easily mitigated by the consumer if they are concerned) is worse than the damage done by a package not being published (which cannot).