Stop Allowing deleting things from PyPI?

dstufft · July 12, 2022, 6:55pm

Thanks! I’ve never personally used a lot of these repositories, so researching this involved trying to find documentation one way or the other (which I could find for most of them, but not for LuaRocks).

I’ve updated my chart to mark LuaRocks as allowing deletion.

dstufft · July 12, 2022, 7:18pm

Other random musings:

Some of these other languages support “yank” on the project level, which effectively yanks every release, so that pip install foo stops working, but pip install foo==1.0 does not. The key difference between just yanking every release, is that it also removes the project from the UI, search results, etc (presumably it still shows up in the maintainer’s management UI).

Another feature that some of these repositories have is the ability to mark a project a project as being deprecated or no longer supported. There’s an open issue for that at pypi/warehouse#345 ^[1].

I wonder if either (or both!) of these features sufficiently address the important use cases for deletion, without actually being deletion, or if not is there similar features that we could implement that would?

Interesting question is how this would work in conjunction with a yanked release, would an installer be expected to output both warnings, or does one of the warnings supersede the other? ↩︎

Lawouach · July 12, 2022, 7:23pm

cargo supports yanking a crate (I take is equivalent to our project) and therefore all its versions at once.

dstufft · July 12, 2022, 7:31pm

Yea, crate yanking is most closely mapped to PyPI’s yank support, not PyPI’s deletion ^[1], it was definitely one of the ones that inspired the yank at project level thought though!

It doesn’t work exactly the same, because crate’s version of yank actually prevents new version pins from happening so that it only works when installed from a lock file. Which is a nicer property, but we don’t have any way to do that. ↩︎

Lawouach · July 12, 2022, 7:47pm

Slightly side topic but I wonder how go manages this considering projects are github repositories the ecosystem doesn’t control.

pradyunsg · July 12, 2022, 8:38pm

Deprecating projects on PyPI was also discussed in

hugovk · July 13, 2022, 8:04am

To get download stats without setting up BigQuery:

ChrisBarker-NOAA · July 13, 2022, 10:48pm

One perspective on how to think about these issues:

When a user decided to rely on a package on PyPi, are they making the choice to rely on PyPi, or are they making a choice to rely on the package maintainer(s)?

I would argue that whether they realize it or not, they are relying on the maintainer, not PyPi – after all PyPi does not control who can up-load what, and certainly does not control at all the quality of the package, or any future maintenance, etc.

Any of us using open-source software should be careful about relying on any given third-party package – evaluating its quality, suitability, and likelihood of being maintained adequately – that needs to be done regardless of PyPi’s policy, and PyPi has no control over most of that.

PyPi s a distribution system, I’m not at all sure it should be providing anything beyond distributing packages.

Side note: Is there a policy that only Open Source software can be distributed on PyPi?

dstufft · July 13, 2022, 11:10pm

No.

PyPI has no requirement on what license software distributed on PyPI has, other than the requirement to allow redistribution as specified in the terms of use:

If I upload Content covered by a royalty-free license included with such Content, giving the PSF the right to copy and redistribute such Content unmodified on PyPI as I have uploaded it, with no further action required by the PSF (an “Included License”), I represent and warrant that the uploaded Content meets all the requirements necessary for free redistribution by the PSF and any mirroring facility, public or private, under the Included License.

If I upload Content other than under an Included License, then I grant the PSF and all other users of the web site an irrevocable, worldwide, royalty-free, nonexclusive license to reproduce, distribute, transmit, display, perform, and publish the Content, including in digital form.

Or roughly, either the included license must give PyPI and PyPI’s users an irrevocable right to (re)distribute unmodified copies of the software without royalty or other requirements OR you must grant such a license as part of uploading to PyPI.

VanL has previously remarked that importantly, this does not imply a right to use the software, or modify it, or do anything with it but distribute it, unmodified.

In practice of course, most, if not all, of the software on PyPI falls under the first category, because it’s distributed using some kind of FLOSS license, but it’s not a specific requirement that it is.

dstufft · July 14, 2022, 1:27am

I’ve been thinking hard about this, because I think it’s an important question, but I also think that there is a lot of hidden nuances to it, and ultimately it boils down to balancing between maintainers and users.

Deletion from PyPI doesn’t cleanly fit into a form of striking and protest in the real world, so it’s hard to fit it into an existing box.

I don’t believe that it is exactly like a doctor removing sutures or a habitat for humanity bulder removing bricks or something like that, because that is taking your labor away from the individuals, typically long after capital has already profited off of it.

The closest parallel I can draw, is that deleting files off of PyPI is similar to breaking into a store or warehouse and taking a product of your labor off the shelves to prevent capital from further profit off of it until they meet the demands of labor. In the real world, this isn’t really a required thing, because simply by not working, the flow of new items is halted, and those shelves will ultimately empty themselves as things get used or sold, and capital is forced to come back to labor to ask them to produce more of it.

That doesn’t really fit in software though, if you stop producing new versions of your software, the existing versions are all still there and can be reproduced infinitely. Granted any problems with that software will remain unfixed, and it would slowly fall behind (but for a lot of software, that isn’t a major concern), but ultimately capital can continue to create infinite copies of the product of your labor without your input, so the power of just stopping work is a lot more limited than in the real world.

Ultimately, I think that as it stands today, deleting things from PyPI is a poor mechanism for implementing a strike, and while it’s something we should consider, I don’t think that it should be a primary concern.

To expand on my reasons why I think that it is a poor mechanism:

The first problem I see with it, is that it primarily affects the people with the least amount of power, the individuals using PyPI, not the companies or other large organizations. Due to the nature of what we’re doing here, the effects of striking through deletion on PyPI are trivially prevented by running your own mirror of PyPI with deletions turned off, something that basically every mirroring solution for PyPI makes easy. In my experience, companies and large organizations rarely even bother to figure out the reason why a package went missing, they just learn that they can go missing, and make plans to prevent that from affecting them.

While setting up a mirror of PyPI is a relatively simple task for any company or large organization, expecting every individual user of PyPI to do the same is not. Thus, this form of strike primarily doesn’t strike back against capital, but rather at other individuals within the ecosystem, and where it does strike against capital, IME their lesson is setting up fairly easy mechanisms to prevent themselves from being affected in the future, rather than weighing it as part of a larger negotiation.

Importantly, once they’ve done this once, they’re insulated forever from these kinds of strikes by anyone on PyPI, meaning that if this method of strike had much uptake, quickly even the companies who don’t already have a mirror would converge on setting one up.

The second problem I see with it, is that due to the immutable nature of files on PyPI, deleting your files from PyPI is not like a strike in that it is a temporary stoppage of work until demands are met, but it’s closer parallel is in burning down the factory. Even if your demands ultimately are met, you can’t put those files back, you can only release new versions, and that will require the entire ecosystem to cope with the new version numbers. In our parallel, not only have you burned down the factory, but you’ve also salted the earth where it stood and the next factory will need to be built in a new location.

This same situation doesn’t exist on the author’s personal websites or on Github, since once your strike is over you’re able to republish bit for bit copies of the previous content.

The third problem I see with this, is in my experience when these kinds of things happen, the general thrust of the discussion ends up being around the fact that the repository allowed that to happen, or scolding the people affected for not setting up mirroring, or general grumping around the state of the toolchain (if we all just vendored like X, then this couldn’t happen, or whatever). Whatever the underlying protest or demand ultimately was, typically quickly gets lost in the noise of discourse around the mechanism of their protest.

So, while I recognize that this is a form of strike available to maintainers today, and it is often the most visible strike option they have, I think that it’s ultimately a poor option for it. Unfortunately, I don’t have much of an idea of a better way of striking, but personally the above issues are enough that I don’t feel like we’re actually taking a particularly effective tool away from maintainers if we disallow deletions.

dstufft · July 14, 2022, 3:23am

I want to throw out an idea for a proposal that would restrict all forms of deletions from PyPI.

I’m not entirely sure how I feel about this yet, but I was already feeling somewhat sketch about restricting only project deletions as a weird sort of half measure and looking at what other repositories are doing has me feeling even more worried that a half measure might end up being the worst of both worlds, rather than a solid compromise.

Please take this proposal with an appropriately sized grain of sand, particularly where the numbers are concerned. I’m just pulling some numbers out of thin air for now and trying to put pen to paper on what I think a reasonable policy that moved to restrict all kinds of deletions would be, to see how people (including myself) feel about it.

This policy and related proposed features are largely modeled after taking the parts I liked from what I found all of the other repositories combined with the concerns and use cases people have raised in here. I’m going to lump releases and files together, as I think that there isn’t a useful distinction between them for this discussion.

The policy would then be:

Projects may not be deleted if they have any releases/files associated with them ^[1].
Releases/Files that are a PEP 440 pre-release may be deleted at any time without restriction.
Releases/Files may be deleted within the first 72 hours of release without any other restriction ^[2].
Releases/Files that are older than 72 hours, that have been downloaded by a known installer (e.g., not mirroring tools or browsers) less than 1,000 ^[3] times in the past month, may be deleted without any other restriction ^[2:1].
Otherwise, deletions are not available without contacting the PyPI admins ^[4].
Any deletions that are allowed, follow the same caveats as they do today:
- Irrevocable
- Projects are released back to the overall pool to be registered by anyone
- Files, once deleted, may never be re-uploaded.

To support some of the use cases for deletion today, we would also add some form of the following features:

The ability to “yank” at the project level, which would mark all files as yanked in the simple API, and would remove the project from the web UI ^[5], search results, JSON API, etc.
The ability to apply a “notice” at, at least the project level to provide some feedback to people finding this project in the UI or installing it ^[6].
- Alternatively (or maybe in addition to) we can provide a way to “archive” a project, similar to what GitHub does, that marks the project read only. Maybe this would just be rolled into the “notice” feature, or an option when you apply a notice, or maybe it would be its own distinct thing.
The ability to register a name on PyPI, without uploading a release/file to go along with it ^[7].

These things I don’t think are blockers, but I think it might be good to think about them:

Re-evaluate whether the quota system is actually giving us the results we wanted when we first put in place. I see that tensorflow nightlies are still a massive chunk of storage, which suggests it might not be, but it’s hard to say what PyPI would look like without it and I haven’t been involved in those requests lately, so I lack context to say for sure ^[8].
Consider whether there are any situations where we want to automatically garbage collect uploaded files. In particular, I’m wondering if it would make sense to automatically reap old development releases, likely with some stipulations, as I think having a historical record of every development release is less useful than every final release.

If we were to restrict deletions like the above (which again, is just me thinking through what it might look like if we did), I think that the above might represent a pretty reasonable balance? It would be hard to argue that this is out of line with people’s expectations, since the majority of other language repositories have something resembling the above and PyPI is one of the few that allow unrestricted deletion at all ^[9].

It is obviously a restriction on what maintainers are able to do today, but I think that I’ve carved out exemptions that generally match what most people would agree are the situations when it’s “safe” to delete a file and the rare edge cases that aren’t covered by that, we still have admin intervention available to us.

Just to go back through the thread and match the proposal to concerns people had or situations they brought up that they felt where deletion was justifiable, I see the following:

Rules apply universally ^[10], and don’t attach any labels to projects which avoids PyPI making any sort of statement, real or otherwise, about the projects.
The cases of a “bad” ^[11] release is still generally able to be removed if it’s discovered quickly, tempered by the fact that if you haven’t discovered it quickly, then we’re prioritizing artifact stability unless the problem is serious enough to warrant admin intervention.
The cases of “cruft” or placeholder packages are still able to be removed, tempered by the fact that if people are actually using or downloading this “cruft”, then it might not actually be cruft as the author assumed as someone is obviously using it for something.
Development/Nightly releases are still able to be removed, as their version numbers should communicate to end users that those files are not stable things to depend on ^[12].
We don’t handle edge cases like PyTorch’s really old sdist, but I think that’s fine since PEP 592 should have handled that (and pip 22.0+ fixed that), and even if it didn’t, that feels like a sufficiently weird edge case that admins could handle it.
Strikes can still be implemented by using the notice feature and/or the yank project feature, it’s still trivially able to be worked around using ==, but as mentioned above, the current situation is also trivially routed around and since you can unyank/unnotice, cleaning up after the strike is over is a much smoother process.
Quotas are still possibly a problem, but that’s a service availability concern not an API / feature concern for PyPI, so that will require the PyPI admins to talk and figure things out.

Anyways, that’s what I would envision implementing if we brought PyPI in line with the bulk of the other language repositories, and restricted deletions to provide better stability to the ecosystem.

I’d be interested to hear if people feel really strongly one way or the other about the above. I’m not sure how I feel about it yet, my instinct is that I think it would be a positive change and it does a good job balancing things, but I haven’t had enough time to roll it around in my head to decide if that instinct is right or not.

Maybe it would be useful to allow full project deletion if the project itself is less than some number of days old. Like say if the project is less than a week old, then you can delete the project and all files associated with it. ↩︎
Once we have reliable dependency metadata, it may be a good idea to further restrict this to say that deletions would also be prevented if anything in PyPI depends on this and isn’t satisfied by some other available version. ↩︎ ↩︎
This is pulled completely out of thin air, if we went this direction, we’d want to do some information gathering to see what various cut offs would allow to be deleted from PyPI. ↩︎
We’d maybe want some specific policy on when PyPI admins would do a deletion, or at least general guidelines. Obviously, content that wasn’t legally distributed, malware, placeholder projects, etc would likely fall under those guidelines. ↩︎
Obviously it would still show up in the maintainer’s UI, to allow them to un-yank the project in the future. ↩︎
This might be restricted to just deprecation notices? Or maybe it would be best to leave it open ended for authors to put any kind of notice they want. Or maybe we’d have some notice categories, but not fully anything goes. In any case, this isn’t meant to be a fully fleshed out design, just a rough idea. ↩︎
In the past we were hesitant to do this because we were worried about making it too easy to squat names, but with PEP 541 I think that we have a reasonable process for dealing with that now. There’s also an open issue (pypi/warehouse#11296) that asks for this for other reasons. ↩︎
Certainly, the quota system as it exists today represents a non-zero amount of effort for both the PyPI team (who have to handle quota requests) and maintainers (who have to either ask for more quota or find ways to work around the quota). ↩︎
In fact, the author who deleted atomicwrites and spurred this whole discussion to happen right now assumed that PyPI was similar to these other language repositories, and that deletion didn’t mean that previous artifacts would be removed. ↩︎
We are using download counts to decide if a file can be deleted outside of the specific outlined scenarios, which is slightly not universal. However, that trade off exists to allow “cruft” and placeholder packages to still get deleted without admin intervention. ↩︎
Completely broken, has credentials leaked inside of it, is containing files that aren’t legal to distribute, whatever reason authors might have. ↩︎
We maybe don’t need this rule, as the way PEP 440 recommends installers to work, pre-releases are excluded by default and may just fall under the 1000/mo download threshold naturally. On the other hand, it may be worth adding this explicitly anyways just to make it clearer, and it’s possible the 1000/mo download threshold is a bad idea on its own anyways, or maybe we would decide the 1000/mo threshold is a one-way switch and once a file crosses it can’t be deleted. ↩︎

ofek · July 14, 2022, 3:33am

All of that sounds incredibly sane to me; I’m in favor!

dstufft · July 14, 2022, 3:42am

I forgot to mention, that regardless of what direction we go in, if we changed the deletion rules from what they are today, it’s a sufficiently big enough change to how PyPI works that I would want to write a PEP akin to PEP 541 that outlines whatever policy we end up with.

I think it’s useful to have a record outside of a thread here on why a change was made, a central place to document what that policy actually is, and so that it gets additional visibility beyond just a thread here.

So, if anyone is worried about one of these changes, please speak up! But also, consider this thread more like a brainstorming session rather than actually making a proposal right now.

rgommers · July 14, 2022, 8:26am

This sounds great to me, thanks @dstufft. You addressed several of my concerns explicitly, and deleting actions that I most regularly have (pre-releases) is still possible. Anything else is still possible via a support ticket. The response times on those are a tricky story, but that should not determine the policy for something so important.

This would be helpful as a separate follow-up action/discussion.

pf_moore · July 14, 2022, 8:37am

Thank you. All of that sounds very sensible (as you say, there are details to work out, but the principles and ideas are great).

In particular thank you for the amount of thought you’ve put into thinking about and addressing the concerns raised. It’s not easy to review, much less address, comments in a controversial topic like this, and I for one appreciate the time you spent doing so.

graingert · July 14, 2022, 9:10am

I particularly like this part of the proposal

Projects may not be deleted if they have any releases/files associated with them

Because it removes a convenience around deletion without actually changing the policy. Are there any other ways to remove convenience or introduce friction into the deletion process, eg remove deletion from the html front end and require users to issue requests via the API instead?

remram44 · July 14, 2022, 3:18pm

A bit late to the party, but I want to add that files are never deleted from files.pythonhosted.org anyway. Any deleted release can be found via the Big Query log and installed from its URL (here is numpy 1.11.2rc1 from 2016, “removed” from PyPI). There are also third-parties like Software Heritage that work specifically on keeping old packages around for academic reproducibility purposes. In fact it might be dangerous to let users think that their file has been deleted.

Allowing deletion of releases from PyPI doesn’t seem to do anything meaningful right now, apart from a mild annoyance for users (especially bad when it’s triggered by mistake by authors who don’t know what “yank” is for).

dstufft · July 14, 2022, 3:25pm

In talking to people outside of this thread, another concern has come up that I’m not entirely sure how to handle it yet, but I want to mention it to say that (1) I am thinking about it, and (2) throw it out there for others to brainstorm too.

The general thrust of the concern is that if an author wanted to distance themselves from a project, or from PyPI itself, they’re currently able to do that on PyPI by deleting the project and then deleting their account on PyPI.

Currently PyPI allows you to delete your account on PyPI only as long as you are not the sole owner (not maintainer, owner) of any projects on PyPI, encouraging you to either hand those projects off to other people or delete them as makes sense.

Obviously with any of the proposals here, authors are going to be able to fall into situations where they cannot delete their account because there is nobody available who wants to take over any projects that are not able to be deleted by any of the above rules, which means they’re forced to remain associated with that project and/or PyPI ^[1].

Now to some extent this is unavoidable. PyPI is constrained such that it cannot mutate the artifacts on PyPI itself, so any metadata inside of that artifact that links the developer to that project cannot be removed or changed in anyway. Additionally, the metadata the PyPI stores in the database that logically is sourced from inside of the artifact needs ^[2] to match the data that is inside of the artifact itself, which limits our ability to change that metadata.

However, I do think that there is an underlying issue here outside of the metadata itself, which we should give some thought to, and how we handle them.

Off the top of my head, I see a few possible options:

Do nothing, as there are at least three possible work arounds:
- Users on PyPI can change their name, email, avatar, everything except for their actual username, allowing them to replace it with some form of anonymization.
- They could create a new user account with an anonymous username, transfer all of their projects to that, and remove their “real” account from those projects, which would then allow them to delete themselves.
- From an end user point of view, project level yank is pretty close to deletion since it removes the project from all user facing content, and the simple API is basically just a list of files with some related metadata (hashes, python-requires, etc). That doesn’t let the person get rid of their PyPI account, but if they just want to distance themselves from a project, it’s pretty close.
Add the ability to change your username on PyPI, but otherwise do nothing ^[3].
Allow users to hide their association with a project in the public UI/API ^[4].
Enable some way for a user to “abandon” a project, removing it from their account but not deleting it ^[5]^[6].

Of those, all 4 options allow a user some level of distancing themselves from a project, but only the (4) option provides a user the ability to completely walk away from PyPI and sever all connection and remove all accounts they have ^[7].

Interestingly enough, the more restrictive policy actually handles this better, since the “half measure” policy doesn’t allow deleting projects at all you’re forced to hand off even nonsense packages like dstufft.testpkg2 unless they’re brand new. The more restrictive policy only requires you to hand off packages that have more than some number of downloads (1000/mo or whatever number would actually get settled on). ↩︎
For a lot of historical reasons, this isn’t enforced today, nor has it ever been. The direction we’ve been moving though has been to narrow the gaps where this isn’t the case, in the hopes of eliminating them for new things on PyPI, but it’s unlikely that will ever be something we can apply retroactively. ↩︎
This is basically just accepting the first option as the official way of doing this and closing the gap on the last piece of a user account that you can’t change. ↩︎
This would be similar to how GitHub allows users to choose whether or not their membership in an organization is public or not. It would still show up on the management UI for that project, but not in public facing pages. Still obviously doesn’t allow people to step away from PyPI itself but lets them hide their association with a project. ↩︎
This would effectively mean that an abandoned project is forever read only unless we allow some form of PEP 541 take over to pick up maintenance on a project. Implementation wise, we could either just remove all roles from the project or create a PyPI admin-controlled account to act as the holder of abandoned projects. ↩︎
One could even imagine fleshing this feature out more, providing some sort of match making service that allows people to match up with abandoned projects that need maintenance. Of course, there’s still the underlying problem of determining who to trust to hand an abandoned project off to, but that would be a problem for someone actually proposing that hypothetical match making service to solve. ↩︎
Of course, there is still the implicit association inside of the artifacts themselves and/or inside of the metadata that came from those artifacts, but as mentioned earlier that is unavoidable due to technical constraints. ↩︎

dstufft · July 14, 2022, 7:05pm

6 posts were merged into an existing topic: Amending PEP 427 (and PEP 625) on package normalization rules

dstufft · July 14, 2022, 7:06pm

This is true today (and has been for a long time), but I want to stress that this is currently an implementation detail of PyPI and is subject to change at any time, so anyone relying on this is inherently relying on something that can change without warning.

Granted, it’s unlikely that we change that behavior, it has useful properties for us and the PyPI team has used those files on and off for a variety of random one-off reasons. I just wanted to be clear that it’s not a guaranteed property of PyPI currently like any of these proposals would be.

Edit: Reposted, since the original post was moved with the normalization discussion.