My colleagues and I have been given some time to write and implement a PEP based on the consensus in that thread. In particular, I’m planning on starting a PEP based on the policy that @dstufft summarized in Stop Allowing deleting things from PyPI? - #71 by dstufft, which I’m copying verbatim below:
Projects may not be deleted if they have any releases/files associated with them.
Releases/Files that are a PEP 440 pre-release may be deleted at any time without restriction.
Releases/Files may be deleted within the first 72 hours of release without any other restriction.
Releases/Files that are older than 72 hours, that have been downloaded by a known installer (e.g., not mirroring tools or browsers) less than 1,000 times in the past month, may be deleted without any other restriction.
Otherwise, deletions are not available without contacting the PyPI admins.
Any deletions that are allowed, follow the same caveats as they do today:
Irrevocable
Projects are released back to the overall pool to be registered by anyone
Files, once deleted, may never be re-uploaded.
(The original comment has associated considerations, such as project-level yanking, which I’ll also make sure to include. It also has others, e.g. around pre-registration, that may now be stale thanks to Trusted Publishing w/ pending publishers.)
I’m opening this for pre-PEP discussion, to avoid necro-bumping the old pre-pre-PEP discussion thread. In particular, I’m curious if peoples’ opinions have changed in the last ~1.5 years on the best approach to limiting deletions and, if so, I’d love to hear those opinions before I go into full swing on the PEP draft
Can this one be handled automatically? If it requires manually checking the download count, it probably falls under the following point (“contacting the PyPI admins”) anyway.
Other than that, my only concern is in deleting old versions for the sake of staying under the size quota. I have more interest in preserving that ability than any other concerns about deletion - we ($work) face more problems in failing to publish due to exceeding quota than due to deletions.
I would like if there was an option to deduplicate and redirect users to some other package. I have created a bit rash emergency preservation project nntplib when it was threatened to be removed from the standard library. However, I really don’t have enough time to spent on yet another project (and not enough experience with mocking, which would be highly needed, IMHO), so I was glad to discover pynntp and I would like to redirect any of mine users (if I have any, which I am suspicious about) to this other project. In the end I have just removed it and let it be only in my private git repository, but it would be much better if there was some SOP for such deduplication.
I think so, insofar as PyPI should have access to its own linehaul/BigQuery statistics. But that one stood out to me as the constraint/policy control that I’m least confident about actually implementing, in part because I don’t know how slow (or expensive) it’d be to periodically query every new package
Thanks for raising this – I wasn’t sure if it belonged in this (planned) PEP or not, but I’ve been thinking about a long-term solution to the quota problem on PyPI as a logistical blocker to actually rolling out restrictions on deletion.
In other words: my approach here was going to be (1) write a deletions PEP, (2) implement that PEP behind an admin flag, so that it has no effect until the larger quota problem is solved. But perhaps that order is backwards, and the PEP should explicitly discuss resolving the quota problem? Curious what you think!
I agree! This is another thing I wasn’t sure about including in the PEP, or leaving for a later point: my colleagues and I have also been given some time to work on a “project status” reporting mechanism for PyPI, which would (in principle) include the ability for maintainers to say “sampleproject is deprecated and abandoned, we recommend you consider sampleproject-ng” within structured metadata.
That alone has several degrees of freedom, so I didn’t want to make the PEP into “address everything that’s currently non-ideal about package lifecycles.” But like with the quota stuff, I’m happy to adjust the PEP’s scope to the community’s expectations, rather than bring things up piecemeal
How will this interact with the per-project total file upload limits? Right now, several projects periodically delete older unsupported releases due to those limits.
At the very least can we have some sort of record of deletions that do happen? I recently had a project that had something in its requirements that didn’t seem to exist anymore. The name was just gone from pypi. No trail, no history, etc.
I happened to have a local copy so was able to piece together what I needed but optimally I would have preferred it not disappeared to begin with.
I’m generally +1 on the proposal, which I interpret more about limiting project deletion requests from project owners than any kind of system-wide culling of existing empty packages. Do you have any statistics project deletion requests today that would fall under these guidelines?
As for culling existing empty projects, is that anything you’re considering either as part of this proposal or to be considered later (based on this criteria)?
I’ll be posting a PEP soon (PR is in review) that may help with this, not by directly addressing the size quotas, but by giving you another option to avoid them.
I believe the current thinking there is that the organizations feature will enable better quota management for both corporate and non-profit organizations that regularly run up against the total upload limits.
That’s a somewhat indirect answer though, which ties in with the deployment blocker I mentioned in Pre-PEP: Limiting deletions on PyPI - #4 by woodruffw – my plan is to propose a deletion restriction policy and implement it, but gate that policy behind an admin flag until organization quota controls are completed.
(I believe @miketheman might have some more context/thoughts there )
I might be misunderstanding, but do you mean something other than the current project journal? When a project/release/file is deleted, PyPI currently emits remove project, etc. events to the journal, which external users can query using the APIs listed under Mirroring Support.
(Those APIs are part of the legacy XML-RPC set, but IIUC are considered fully supported since there’s currently no other journal retrieval mechanism. When another mechanism becomes available it may be deprecated andf removed, but that new mechanism will also support detecting deletions.)
Not on hand, but I can try and obtain those! I think I can get a high-level view through the journal records, but I may need to ask the admins to pull numbers as well.
In particular I think these stats are useful ones, but let me know if there’s others I should try and get:
Number of files/releases/projects deleted over the last month/3 months/year
Top X projects with the most file/release deletions
Top reasons for deletion (I believe these aren’t recorded anywhere, but can probably be inferred from the top X projects)
My inclination was to leave this out of the PEP, if only because I suspect there isn’t as much consensus on an appropriate policy for recycling/culling extant projects
No i mean that a package i used disappeared off of pypi. I’m a regular user (by no means an expert on pypi). I shouldn’t have to try to trace the journal of pypi events to figure out what happened to a package I used to depend on.
I agree that deletion information should be more discoverable, but I think standardizing a new way to present deletions on PyPI is probably outside the scope of this PEP[1]. That arguably falls under a larger packaging index/API modernization PEP, a la PEP 503 and 691.
(This PEP will indirectly improve this, by reducing the number of deletions that actually occur.)
It’s also thorny: PyPI could surface user-induced deletions in a new way, but there may be administrative/security/legal deletions that don’t necessarily belong is a public log, or whose volume may reduce the value of a log by drowning out “interesting” deletions (e.g. mass deletions of package spam). ↩︎
Very excited to see work towards preventing a left-pad on PyPi. Thank you for starting this thread!
Not Steve, but I think this is critically important to cover in a PEP limiting file deletion. The PEP needs to make an argument about why the status quo should change, and it needs to address the implications to restricting deletion.
+1 to getting some statistics about current deletion patterns, would be useful to get an idea of how often we expect tickets for file deletion going to PyPi admins.
I also was wondering, are there less restrictive schemes that would accomplish what you (well really many of us ) want? If we’re trying to prevent left-pad, perhaps we should prevent/limit project deletion, and rate limit release deletion to some quite long length (e.g. a week or a month). That’s just a strawman, but I think it would prevent a lot of the harm we are trying to avoid while allowing deleting individual files (for project size limits), or a single release occasionally (say for legal reasons).
Agreed on both counts! I’ll make sure the PEP presents an argument for the change as well as the implications of restricting deletion; I just want to shy away from lassoing specifically to a quota spec (in the interest of solving one problem at a time, and also in preserving the standard/index policy distinction).
My interpretation of the original 2022 thread was that there’s general (hand waving!) consensus that the index as an immutable store of package history[1] is considered independently desirable.
In other words, the goal is twofold: (1) preventing the next left-pad style incident, and reducing the ability of any given party to introduce opacity/ambiguity/triage difficulty in the software supply chain (e.g. deleting a piece of suspected malware so that third parties can’t analyze it).
That’s a long-winded way of saying that I think rate-limits would mostly achieve (1), but leave (2) on the table. I think they’d also be more challenging to write generalized deletion policies around, e.g. someone wanting to delete every file in a large (12+ files) relase, only to end up leaving it in a partial state due to rate-limiting.
Modulo legal/administrative/etc. deletions, which are not covered by the restricted deletion policy here. ↩︎
Would it be easier to merge this and say any release older than a month can’t be deleted by the user? Or simplify to a week? Why the difference by download is probably my main question? Is the assumption a release that isn’t downloaded that much could come from an experienced developer and thus need more time to undo a mistake?
Sure. I came to the original discussion when something a project I work
on disappeared without notice - it wasn’t totally fatal it was a CI
dependency, not a project dependency, but it still broke our workflow.
Furious at the time… turns out the package used a web service the
submitter-org wasn’t going to keep providing, so you couldn’t win either
way: delete the package and break people; leave it there so people can
continue to pull it automatically and then fail anyway once the service
went away. Probably the abrupt removal was handled poorly. This is just
to show there’s never going to be a nice way to handle every case. I
tend to lean in the direction of “you can’t rewrite history, even if
historical versions are broken”, but that’s just one opinion.
I think it’d be a lot easier, yeah! From another pass on the 2022 thread it looks like there was less consensus on download counts being a good metric/boundary for allowing deletions on older releases/files anyways, so I’m tempted to unify them for the initial draft and leave out download counts.
(I don’t have a strong opinion on 72 hours vs. a week vs. a month – if someone feels very strongly that a shorter or longer deletion window if appropriate then I’d like to hear it, otherwise I’ll likely err towards uniformly shorter )
The edge case there would be a very popular package (like, 1000s of downloads within a few days of release) getting deleted a few weeks later, potentially causing lots of confusion/disruption to its users. I imagine that the rare packages that are that popular would be very careful about deleting anything, in any case.
Hmm, good point – I think my inclination in cases like that is to suggest an even shorter deletion window (72h or maybe even 24-48h?), or to maybe additionally condition the deletion window on a project’s overall downloads (the thinking there being that any project that reaches, say, 10M downloads is probably going to cause a nontrivial amount of disruption if any release/file is deleted instead of yanked).
I don’t know if it’s a good point just trying to reason about what packages would be affected by going from those two conditions to one combined condition.
It seems like the desirable features are
it would be useful for people to be able to delete stuff that gets uploaded by mistake/prematurely
it should be hard to delete something that a lot of people are relying on–i.e. it should require an appeal to PyPI administrators to convince them it needs to happen
maybe some other considerations?
I think the original rules make sense for those two goals. I don’t know if the numbers make sense because I don’t know much about PyPI stats–how many packages meet the “1000 downloads in the last month” threshold? It might be that this exception should be limited to truly obscure/abandoned projects and so it’s more like “10 downloads in the past year”