Publishing nightly builds on test.pypi.org with a time-based retention policy

Is there a standard way to publish timestamped nightly wheels (such as projectname-X.Y.Z.dev0+20200116035252-cpXX-cpXX-win_amd64.whl or projectname-X.Y.Z.dev20200116-cpXX-cpXX-win_amd64.whl) on test.pypi.org and to have a system to automatically delete wheels that are older than a couple of days?

For scikit-learn we typically build binary wheels for at least 3 Python versions times 3 platforms times ~1.5 (32 bit and 64 bit Python for Windows and Linux). Each wheel is at least 4 MB. So in the order of ~50 MB per day just for scikit-learn, so ~20 GB per year if the old files are not automatically deleted.

I noticed that tensorflow has 2 ancillary packages for nightly builds: tf-nightly and tf-nightly-gpu. Each of them. The tf-nightly wheels seems to weigh ~2.2 GB which is ~800 GB per year.

This looks very wasteful to me.

If there is no built-in way to set-up retention policies for timestamped dev releases on pypi.org or test.pypi.org, one could try to setup a cron job on some CI server to automatically delete older files. However the warehouse API does not seem to allow for file deletion: https://warehouse.readthedocs.io/api-reference/

It seems to me that the best answer for temporary releases like nightlies would be to set up your own simple index (PEP 503 has the format you need, it’s not complicated) and direct your users to use that. You can set your own retention policies, etc, without needing to wait for Warehouse to implement anything.

Thanks Paul.

For a single project that would work fine. But ideally we would like to have a single index shared by several projects that have CI workers that upload and download the nightly builds of each other dependent project so as to be able to run the tests against the master branch of each other project without having to re-build all the upstream dependencies each time.

I liked the idea of using test.pypi.org for this as it makes it possible to have shared index where each project maintainers’ team can manage its own upload credentials / tokens: the scikit-learn developers can only upload the scikit-learn wheels and not mess around with the numpy wheels…

In the mean time I think we will use the anaconda cloud service that can provide PEP 503 compatible index, but as far as I know if would not provide per-project upload permissions on a shared index.

Hmm, you could still have a single index that contains URLs to project-specific areas, surely? Or alternatively, I imagine something like devpi could handle this.

I guess the implied requirement here is “a hosted service that already exists so we don’t have to spend project resources building a publishing solution rather than working on the projects”. But in that case, I don’t think there is such a thing. As you say, PyPI/warehouse is not really designed for large, transient artefacts of the sort you’re describing.

1 Like

For the record, anaconda.org allows for per-package upload permissions in a shared organization feed. So it sounds like a good solution for our use case.

For the longer term, I still think that would be nice for the wider Python community to have a standard way to publish nightly builds on an official channel, for instance, on a nightly.pypi.org instance of warehouse (with a generic time-based or sequence-based retention policy). This would make it easy for the test automation of all project be able to run their tests against the latest development branch of all their dependencies.

It would even make it easier for CPython itself to test that a new Python release will have quick support of all the major top level dependencies of the ecosystem.

2 Likes

If you have an Azure Pipelines account, you should be able to set up an Azure Artifacts feed. Unfortunately I don’t think the permissions allow for public read/authenticated upload yet, but it might suit your needs?

There is public read for public projects.

It’s possible to create tokens to allow several open source projects by different teams to push their nightly wheels into a shared feed but:

  • the maximum duration for a token (Personal Access Tokens) with 1y which means that continuous integrations system will have to renew tokens every year.
  • you have no per-package granular permissions (only feed-level permissions): so open-source-project-a can upload a new version of open-source-project-b if they both share a feed.

Anaconda channels are more versatile and granular w.r.t. permissions and tokens. They have per-package upload semantics closer to the main pypi.org server.