Restricting "open ended" releases on PyPI?

I’ve tried multiple times to rewrite MarkupSafe as abi3, both in C and in Rust. I’m not super comfortable in either of those languages, and for speed MarkupSafe relies on some Python string information that is not available through abi3. So I haven’t been successful. If anyone came along and contributed a Rust abi3 PR that was reasonably close in speed, I would be eager to accept it.

With the GitHub workflow I’ve set up, it’s approximately as easy to trigger a completely new version as it is to trigger a new wheel for an existing version. So I wouldn’t really mind. But it’s not necessary, since the new release wouldn’t represent any actual changes. It’s a similar objection I have to keeping the Python version trove classifiers up to date, it requires a new version just to say “yep, still compatible”.

5 Likes

This would also probably require specifying tag precedence/ordering and calculation.

2 Likes

I think draft releases are a somewhat separate concern from this. The reason draft releases are useful is mutually exclusive to why the situation flagged is problematic.

I don’t think so.

Speaking for pip, most of the cost around dependency resolution is in fetching metadata for a single package release, and not in fetching the listing of files from the index. It’s also not really feasible to have different answers for same-package-name in a single resolve since we hold all the parsed page information in memory IIRC.

I don’t expect that changing how many files are in a release changes much for resolvers practically, especially since file listings are updated whenever any new file is uploaded.

At best, it means that the HTTP cache is less likely to be invalidated but those have a short-enough TTL already IIRC.

How exactly does this work?

You can’t upload a distribution file with the same filename and, assuming that the package metadata has not been changed, ~all backends should be generating the exact same output file(name).

I genuinely don’t follow when this can be an issue, outside of one somewhat convoluted scenario (below) that I think is unlikely enough that we don’t need to care.

The only case I can think of where this can happen is a package moving from pure-Python wheel to per-ABI wheels but those are the sorts of situations where most projects also bump the major version and if they’re authoring code that builds a Python extension, I’d expect such users to have more familiarity with packaging systems – making it less likely that they’d make this mistake.

And, to be add a bit more context: pinning with hashes is the recommendation for when you want to be secure with pip (we don’t have secure defaults, which isn’t great but there is a documented approach for being fairly secure).

TBH, this is IMO the main benefit of doing this, it’s a (small-surface, yay 2FA) attack vector we’d be defending against here.


I think the decision to make here is how much effort/disruption removing this “quirk” is worth?

I think the benefits of the existing behaviour are worthwhile to not just yank this out without though and it’s a good feature to not need a new release to advertise that you’ve added support for a new Python version (or have time to resolve issues if you aren’t able to build a Windows release due to an automation issue that took more than a week to resolve).

TBH, I don’t think it would matter much either way if PyPI removed this capability – package maintainers will adapt[1] since “cut a new release” is a straightforward solution here (if a bit heavy-handed) and most of the cost for doing that are on someone doing the work on PyPI implementation and any surrounding communication work for it.


  1. Some might also be grumpy about it. ↩︎

1 Like

Thanks for sharing this; with that given, it sounds like the “pessimization” argument is pretty weak.

The scenario below is one I had in mind, but there’s another (also arguably very convoluted) I was thinking of:

  1. foo=1.2.3 already exists on the index
  2. The maintainer of foo uses a matrix of CI runners to build and publish wheels in parallel for different (OS, arch, etc.) configurations. This matrix may be dynamic (GHA supports this) and/or grow additional configurations between versions.
  3. The maintainer makes changes in prep for foo==2.0.0 but forgets to bump the actual version
  4. The build/publish workflows run in parallel, meaning that there’s a race condition: a version filename for a wheel configuration that hasn’t appeared before can be successfully uploaded before a conflicting filename fails.
  5. The maintainer sees the eventual failure and assumes that all wheel uploads failed, when really one or more succeeded (and are now being served with an unexpected version)
  6. ???
  7. Extreme user sadness at some indeterminate point in the future

This is arguably pretty unlikely, but not impossible: many projects do use parallel build/publish matrices on CI systems like GHA, and it isn’t inconceivable that the matrix changes or specializes in a way that allows a new wheel to be uploaded to an existing version by accident.

Fully agreed!

Slight user sadness. Either an install will fail (not good, but not the end of the world) or it will work (so, good, maybe? or are you considering it bad that the user got a later version than expected and it worked fine?)

Either way, user reports the issue, maintainer yanks the offending file, no real harm done.

It’s not good, but it doesn’t seem like it’s any worse than the (far more likely) “maintainer ships new release with a nasty bug in it” situation…

1 Like

It’s worse than that, I think: PyPI doesn’t allow single files in a release to be yanked, so a release that gets “tainted” in this way is either permanently broken or has to be entirely yanked (eliminating the value of being able to roll-forwards additional wheels for the same release).

(The assumption is that the new release is not compatible with the old release, i.e. foo==2.0.0 is intentionally a breaking version but is now being served to some percentage of foo==1.2.3 wheel users for whom the new wheels are more specific than the old ones.)

Similarly, debugging this is obvious from our bird’s eye view, but I think it’d be pretty initially confusing: the user report would be something like “a version that worked for years is suddenly broken,” since user reports generally occur at the layer above wheel/sdist specificity. That’s not to say that it’s impossible to triage or anything, but I do think the root cause here is ultimately pretty opaque (race condition in CI build matrices → partially valid wheel uploads that look like they didn’t actually succeed → dependency resolution includes wheel specificity checks that (reasonably!) aren’t part of the public interface).

With all that, I’ll reiterate that this is speculative and I have no actual quantitative evidence that any package has ended up in this state. It’s just an example I imagined as a “pathological” case for how the current yank/wheel tagging/open-ended behaviors interact :slightly_smiling_face:

That’s a problem with PyPI - PEP 592 allows for yanking at the file level.

So OK, I guess we could add a second restriction to PyPI to mitigate the impact of the first restriction. And maybe that’s even the right thing to do for practical purposes. But it seems a bit off to me, TBH.

It was pointed out to me that there’s something wrong with these queries and results, so I’ve edited the post to hide them behind a details expand, so as to not confuse the conversation.

Summary

On the topic of open endedness, I was curious to understand how prevalent the issue was, so I ran a few queries to answer some questions.

  1. What’s the longest open-ended release has gone between first file and most recent, and for comparison, average and median?
warehouse=> SELECT
    AVG(DATE_PART('day', f.upload_time) - DATE_PART('day', r.created)) AS avg_days_apart,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY DATE_PART('day', f.upload_time) - DATE_PART('day', r.created)) AS median_days_apart,
    MAX(DATE_PART('day', f.upload_time) - DATE_PART('day', r.created)) AS max_days_apart
FROM
    projects p
    INNER JOIN releases r ON r.project_id = p.id
    INNER JOIN release_files f ON f.release_id = r.id
WHERE
    DATE_PART('day', f.upload_time) - DATE_PART('day', r.created) > 2;
  avg_days_apart   | median_days_apart | max_days_apart
-------------------+-------------------+----------------
 9.453471673727293 |                 7 |             30

Note: I use a date difference of 2 since 1 surfaced a fair amount of releases that happen around midnight, so span two days. I didn’t want to change my queries to do second-difference-precision date math since it wasn’t relevant to my queries and I was a little lazy.

So while open-endedness is possible, the worst offenders I could find were at most 30 days.

  1. In recent time, what are some of the projects that exhibit adding uploads after the initial release?
warehouse=> SELECT
    MAX(f.upload_time) AS most_recent_file_date,
    p.name AS project_name,
    r.version AS release_version,
    DATE_PART('day', f.upload_time) - DATE_PART('day', r.created) AS days_apart
FROM
    projects p
    INNER JOIN releases r ON r.project_id = p.id
    INNER JOIN release_files f ON f.release_id = r.id
WHERE
    DATE_PART('day', f.upload_time) - DATE_PART('day', r.created) > 2
GROUP BY 2, 3, 4
ORDER BY 1 DESC, 4 DESC, 2, 3
LIMIT 10;
   most_recent_file_date    |     project_name     | release_version | days_apart
----------------------------+----------------------+-----------------+------------
 2024-01-31 22:17:17.33082  | cybotrade-indicators | 0.0.7           |         22
 2024-01-31 22:11:14.91034  | crosspy              | 0.0.0a3.dev74   |         26
 2024-01-31 22:01:20.637333 | libroadrunner        | 2.5.0           |         17
 2024-01-31 19:58:18.50832  | temporian            | 0.7.0           |         19
 2024-01-31 17:14:49.46431  | topoly               | 1.0.3           |         12
 2024-01-31 15:48:17.348508 | numexpr              | 2.9.0           |          5
 2024-01-31 10:02:09.885435 | autoai-libs          | 1.16.2          |          9
 2024-01-31 07:03:12.093301 | rqfactor             | 1.3.16          |         19
 2024-01-31 06:23:15.471182 | clika-inference      | 0.0.0           |          7
 2024-01-31 06:23:13.950191 | clika-compression    | 0.0.0           |          7

Feel free to dig into why some of these have these behaviors.

  1. How many projects overall exhibit this pattern?
warehouse=> SELECT
    COUNT(DISTINCT p.id) AS num_projects
FROM
    projects p
    INNER JOIN releases r ON r.project_id = p.id
    INNER JOIN release_files f ON f.release_id = r.id
WHERE
    DATE_PART('day', f.upload_time) - DATE_PART('day', r.created) > 2;
 num_projects
--------------
         8715

Anyhow, I’m not weighing in on the merits or risks of open-ended releases, but I often like to have some data handy when trying to understand the scope.

11 Likes

I couldnt say what is wrong with the query, but the first at least is certainly giving the wrong answer

eg PyYAML · PyPI shows uploads ranging from Jul 18 to Jan 18, which is much more than 30 days

2 Likes

For the Pillow project, we have at least once had an occasion when the CI failed to build a subset of wheels. Rather than delaying the release, we released the majority of the wheels that built successfully, then figured out and fixed the rest, and uploaded those later.

A bit more common, we’ve sometimes found problem with a badly compiled wheel after release, and then uploaded a fixed version with a build number.

Why not a patch release? For one, we might not want to ping everyone with a “new” release, when there’s no functional changes, and for most people, the binaries haven’t changed either.

Also, previously our release process had some manual steps. At one point, we had one person who built the Windows wheels, so had to pause the release to wait for them. All this made new releases quite slow and hefty process compared to uploading new files to an existing release.

Our next release will be fully automated using the excellent cibuildwheel and Trusted Publishing to deploy from CI to PyPI, so new patch releases should be much easier, but I still have some concern about using 34 hours of CI time to build, uploading 22 MB of 50 wheels to PyPI, and pinging a new release.

10 Likes

You’re right, and thanks for poking through my assumptions! I’ll have to fix the query and run through them again.

1 Like

I spent some time trying to use the distribution_metadata table in the public BigQuery set, so that anyone else could query/verify, but there’s some curiosities in there, likely due to project name transfers over time, so I abandoned that approach.

I redid one of the queries to get the “meat” of it - and here’s a table (behind the details).

Summary
warehouse=> SELECT
    p.name AS project_name,
    r.version AS release_version,
    ROUND(EXTRACT(EPOCH FROM MAX(f.upload_time) - MIN(f.upload_time)) / 86400) AS days_apart
FROM
    projects p
    INNER JOIN releases r ON r.project_id = p.id
    INNER JOIN release_files f ON f.release_id = r.id
GROUP BY 1, 2
ORDER BY 3 DESC, 1, 2 DESC
LIMIT 30;
      project_name      | release_version | days_apart
------------------------+-----------------+------------
 ioLabs                 | 3.2             |       4067
 digest                 | 1.0.2           |       4041
 django-image-sitemaps  | 1.02            |       3760
 HeapDict               | 1.0.0           |       3615
 waipy                  | 0.0.1           |       3602
 waipy                  | 0.0.6           |       3597
 waipy                  | 0.0.5           |       3597
 waipy                  | 0.0.4           |       3597
 waipy                  | 0.0.3           |       3597
 waipy                  | 0.0.7           |       3595
 python-xmp-toolkit     | 2.0.1           |       3592
 waipy                  | 0.0.8.1         |       3577
 waipy                  | 0.0.8           |       3577
 waipy                  | 0.0.8.6         |       3525
 setuptools             | 0.6c3           |       3405
 vim-bridge             | 0.5             |       3339
 setuptools             | 0.6c5           |       3294
 setuptools             | 0.6c4           |       3294
 geopy                  | 0.93            |       3273
 multipart              | 0.1             |       3253
 jaraco.input           | 1.0.1           |       3223
 setuptools             | 0.6c6           |       3152
 visibility-graph       | 0.4             |       3084
 graphistry             | 0.9.9           |       3068
 export                 | 0.1.0           |       3067
 setuptools             | 0.6c7           |       3056
 metasyntactic          | 0.99            |       3045
 AddOns                 | 0.7             |       2999
 hr                     | 0.1             |       2982
 sphinxcontrib-spelling | 1.1             |       2938
(30 rows)

The top one - ioLabs - looks like they uploaded a wheel 11 years after the initial release.
The waipy ones appear as a slew up .egg files added to older releases.

Anyhow, if there’s other data that would prove interesting that I can get at, let me know!

2 Likes

To add a data point from the other side (a dependency management tool):

In Poetry, we sort of ignore that wheels can be added to an existing release later - for performance reasons. Of course, that has the disadvantage that you have to clear your cache to get wheels that have been added later. We also receive issue reports about missing wheels in the lockfile from time to time (which we respond to with “Please clear your cache”). But in the end, I think the performance benefit might still be worth it - at least for Poetry.

6 Likes

Hey everyone,

Coming back to this thread, I think that the circumstances around this change have evolved. At this point, malware authors are exploiting any lever of “mutable references” that is available to them. In my mind, this behavior is itself a mutable reference.

Not all users are using hashes and lock files and resolution still downloads and is executing code on users’ machines. We should work towards closing the gap to some value in a shorter timespan for PyPI. If there needs to be follow-up work or if we need to adjust the value later, that is all fine to do. Having this hole open now is worrying to me in our new age of autonomous and chain-able exploitation, it just makes the messes harder to clean up and reason about when things go wrong.

If we’re supporting the “noticed botched release, want to fix it without needing a new release” then we’re talking on the timescale of days. Even a value of 14 days until a release is “locked” would go a long way to preventing mass-cleanup events.

My reading of the primary disagreements from the thread are:

  • Whether this prevents supply chain attacks at all: it doesn’t prevent, but it makes the clean-up easier. Users not doing exactly the right thing (pinning old versions instead of hashes) are also more protected.
  • Staging releases would be better: agree, but they are a significant feature including UI design and project opt-in. I think we should consider it separately to this suggestion. Implementing this feature is one more database query during the upload step and is transparent to most users.
  • More ecosystems support updating releases: agree, but many of these ecosystems have additional protections in place, like staging environments. PyPI is publish-to-live installs almost immediately and doesn’t offer staging environments.

If there isn’t any disagreement about implementing this feature in any way in PyPI I would recommend we do so and then we can discuss and change the exact value later.

14 Likes

I would suggest 7 days or another release, either triggering blocking further uploads for the releases prior. 14 days seems like a long time for a botched release to go unnoticed, and yanking + a post release remains a viable option.

any improvement here at all is still valuable on limiting impact on those not already doing the best things possible, so as long as we can pick options that don’t too negatively impact project maintainers, I think it’s a valuable direction to be moving in,

4 Likes

Given we have ongoing attacks making use of stolen API keys, locking down open-ended releases seems like a good protection (nothing much we can do about preventing new releases, but at least those are easily yanked or pinned). Not sure if Seth is re-upping this today because of that incident :wink:

Seven days or a new release seems reasonable - this one was apparently detected within about 24 hours, but it’s within seven days of the previous release, so they could well have tried to add more specific wheels to that one instead of creating a new one and it wouldn’t have been prevented. Maybe a (web UI) option for shorter deadlines could be an option? Ideally I’d like to see a web UI override for publishers, maybe something like “reactivate publishing for another seven days”[1], but I expect that’d be an incredibly rare action, so having it is really just to reassure the much larger proportion of publishers who think they’ll need it.


  1. IOW, recalculate the deadline in the DB for preventing new uploads. ↩︎

4 Likes

While I agree with this for most cases, I think it will require more documentation and has a higher potential for “gotchas” that are visible from a maintainer POV. The implementation is also more complicated than a simple timer.

I’ve created a draft pull request to PyPI with 14 days (which can be reduced if we agree and have better data about median release ages). I do think 7 days is /probably/ fine, but without data on this I think going conservative is best.

I’ll note that I don’t want applying this simple stop-gap to stop discussion of alternatives, I just want to make sure the brakes work while we’re taking our time figuring out the optimal solution.

3 Likes

FTR, I think my opinion is the same as it was 2 years ago:

I think that this proposal does close a real capability that could be used to attack people in a way that is, by default, pretty “quiet”. I also think that capability can, at least in theory, be used for useful and positive reasons.

I don’t think that any real decision one way could be made here unless we get some real numbers behind how often people actually use that capability for useful and positive reasons, and even then we should consider if there are other mechanisms we can put into place to mitigate without removing that capability

As far as I know, we still have very little data on whether people are using this capability for positive reasons or not? The only data I’ve seen surfaced is what @miketheman posted, which looks to be the top 30 packages with the biggest gap between the first file uploaded for a given version, and the last file uploaded.

That tells us that some people have had very long (11 years being the longest!) time deltas between uploads, but I think that’s less relevant than how widespread the practice is, and whether the projects using that functionality are doing so for good reasons.

Do we have any idea how many projects this will (or would have) break currently?

Likely those numbers should have some sort of bias towards recency, as I’m less concerned about what people were doing 10+ years ago and what is in practice today.

This is only true for sdists, of which we only allow a single sdist per version (and if you delete it, you can’t reupload it).

2 Likes

The time-based lock down seems like a workaround for not having Upload API 2.0, which would solve this problem (and a bunch of others) without having an arbitrary cutoff by allowing PyPI to make releases atomic instead of one-file-at-a-time.

1 Like

The time-based lock down seems like a workaround for not having Upload API 2.0, which would solve this problem (and a bunch of others) without having an arbitrary cutoff by allowing PyPI to make releases atomic instead of one-file-at-a-time.

If “Upload API 2.0” is different than staged releases, then it falls in the same bucket I addressed when reopening the discussion: It’s a stop gap that addresses an exploitable issue now giving us time to deliberately design the better solution(s). This isn’t a replacement for upload 2.0, staged releases, or any future better solution.

2 Likes