This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.
Working titles; seeking feedback:
- Guide for PyPI CI Service Providers
- Request from and advisory for CI Services and CI implementors
- PyPI cost solutions: CI, mirrors, containers, and caching to scale
- PyPI-dependent CI Service Provider and implementor Guide
See “Open Issues”:
- Does this need to be a PEP?
- No: It’s merely an informational advisory and a request
for consideration of sustainable resource utilization practices.- Yes: It might as well be maintained as the document to be
sent to CI services which are unnecessarily using significant
amounts of bandwidth.
PEP: 9999
Title: PyPI-dependent CI Service Provider and Implementor Guide
Author: Wes Turner
Sponsor: [Full Name ]
BDFL-Delegate:
Discussions-To: […]
Status: Draft
Type: [Standards Track | Informational | Process]
Content-Type: text/x-rst
Requires: [NNN]
Created: 2020-03-07
Resolution:
Abstract
Continuous Integration automated build and testing services
can help reduce the costs of hosting PyPI by running local mirrors
and advising clients in regards to how to efficiently re-build
software hundreds or thousands of times a month
without re-downloading everything from PyPI every time.
This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.
Motivation
- The costs of maintaining PyPI are increasing exponentially.
- CI builds impose significant load upon PyPI.
- Frequently re-downloading the exact same packages
is wasting PyPI and CI services’ time, money, and bandwidth. - Perhaps the primary issue is lack of awareness
of solutions for reducing resource requirements
and thereby costs for all involved. - Many thousands of projects are overutilizing donated resources
when there is a more efficient way that CI services
can just centrally solve for.
Request from and advisory for CI Services and CI Implementors
Dear CI Service,
- Please consider running local package mirrors and enabling use of local
package mirrors by default for clients’ CI builds. - Please advice clients regarding more efficient containerized
software build and test strategies.
Running local package mirrors will save PyPI (the Python Package Index,
a service maintained by PyPA, a group within the non-profit Python
Software Foundation) generously donated resources.
(At present (March 2020), PyPI costs ~ $800,000 USD a month to operate; even with
generously donated resources).
If you would prefer to instead or also donate to PSF, [earmarked]
donations are very welcome and will be publicly acknowledged.
Data locality through caching is the solution
to efficient software distribution. There are a number of opportunities
to cache package downloads and thereby (1) reduce bandwidth
requirements, and (2) reduce build times:
- ~/.cache/pip – This does not persist across hermetically isolated container invocations
- Network-local package repository mirror
- Container image
There are many package mirroring solutions for Python packages
and other packages and containers:
- A full mirror
- bandersnatch: [new user]
- A partial mirror:
- pulp: [new user]
- Pulp also handles RPM, Debian, Puppet, Docker, and OSTree
- pulp: [new user]
- A transparent proxy cache mirror
- devpi: [new user]
- Dumb HTTPS cache with maximum filesize:
- squid?
- IPFS
- IPFS for software package repository mirroring is an active area of
research.
- IPFS for software package repository mirroring is an active area of
Containers:
- OCI Container Registry
- Notary (TUF): [new user]
- Amazon Elastic Container Registry: [new user]
- Azure Container Registry: [new user]
- Docker registry: [new user]
- DockerHub: [new user]
- GitLab Container Registry:
[new user] - Google Container Registry: [new user]
- RedHat Quay Container Registry: [new user]
- Container Build Services
- Any CI Service can be used to build and upload a container
There are approaches to making individual (containerized) (Python)
software package builds more efficient:
A. Build a named container image containing the necessary dependencies,
upload the container image to a container registry,
reuse the container image for subsequent builds of your
package(s)
B. Automate updates of pinned dependency versions using a
free or paid service that regularly audits dependency specifications
stored in source code repositories and sends pull requests
to update the pinned versions.
C. Create a multi-stage Dockerfile that downloads all of the
(version-pinned) dependencies
in an initial stage and COPY
those into a later stage which builds
and tests the package under test
- TODO: what’s the best way to do this?
D. Use a docker image as a cache
- This requires
DOCKER_BUILDKIT=1
to be set
so that# syntax=docker/dockerfile:experimental
andRUN --mount=type=cache,target=/root/.cache/pip
work - TODO: what’s the best way to do this?
- “build time only -v option”
[new user]
E. Use a container build tool that supports mounting volumes at build
time (podman, buildah,) and mount in the ~/.cache/pip directory
for all builds so that your build doesn’t need to re-download
everything for PyPI on every CI build.
Security Implications
- Any external dependency is a security risk
- When software dependencies are not cached,
the devops workflow cannot run when the external dependency is
unavailable. - TUF (The Update Framework) may help mitigate cache-poisoning risks.
PyPI and CNCF Notary implement cryptographic signatures with TUF:
The Update Framework.
How to Teach This
- A more detailed guide detailing how to do multi-stage builds that
cache dependencies? - Update packaging.python.org?
- Expand upon the instructions herein
Reference Implementation
- Does anyone have examples of CI services that are doing this well
/ correctly? E.g. with proxy-caching on by default
Rejected Ideas
[Why certain ideas that were brought while discussing this PEP were not ultimately pursued.]
Open Issues
- Request for guidelines, tools, and best practices.
- Does this need to be a PEP?
- No: It’s merely an informational advisory and a request
for consideration of sustainable resource utilization practices. - Yes: It might as well be maintained as the document to be
sent to CI services which are unnecessarily using significant
amounts of bandwidth.
- No: It’s merely an informational advisory and a request
References
[A collection of URLs used as references through the PEP.]
Copyright
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
(I’ve also posted this draft to pypa-dev at Redirecting to Google Groups; where I didn’t need to replace links with [new user] )