Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale

This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.

Working titles; seeking feedback:

  • Guide for PyPI CI Service Providers
  • Request from and advisory for CI Services and CI implementors
  • PyPI cost solutions: CI, mirrors, containers, and caching to scale
  • PyPI-dependent CI Service Provider and implementor Guide

See “Open Issues”:

  • Does this need to be a PEP?
  • No: It’s merely an informational advisory and a request
    for consideration of sustainable resource utilization practices.
  • Yes: It might as well be maintained as the document to be
    sent to CI services which are unnecessarily using significant
    amounts of bandwidth.

PEP: 9999
Title: PyPI-dependent CI Service Provider and Implementor Guide
Author: Wes Turner
Sponsor: [Full Name ]
BDFL-Delegate:
Discussions-To: […]
Status: Draft
Type: [Standards Track | Informational | Process]
Content-Type: text/x-rst
Requires: [NNN]
Created: 2020-03-07
Resolution:

Abstract

Continuous Integration automated build and testing services
can help reduce the costs of hosting PyPI by running local mirrors
and advising clients in regards to how to efficiently re-build
software hundreds or thousands of times a month
without re-downloading everything from PyPI every time.

This informational PEP is intended to be a reference
for CI services and CI implementors;
and a request for guidelines, tools, and best practices.

Motivation

  • The costs of maintaining PyPI are increasing exponentially.
  • CI builds impose significant load upon PyPI.
  • Frequently re-downloading the exact same packages
    is wasting PyPI and CI services’ time, money, and bandwidth.
  • Perhaps the primary issue is lack of awareness
    of solutions for reducing resource requirements
    and thereby costs for all involved.
  • Many thousands of projects are overutilizing donated resources
    when there is a more efficient way that CI services
    can just centrally solve for.

Request from and advisory for CI Services and CI Implementors

Dear CI Service,

  1. Please consider running local package mirrors and enabling use of local
    package mirrors by default for clients’ CI builds.
  2. Please advice clients regarding more efficient containerized
    software build and test strategies.

Running local package mirrors will save PyPI (the Python Package Index,
a service maintained by PyPA, a group within the non-profit Python
Software Foundation) generously donated resources.
(At present (March 2020), PyPI costs ~ $800,000 USD a month to operate; even with
generously donated resources).

If you would prefer to instead or also donate to PSF, [earmarked]
donations are very welcome and will be publicly acknowledged.

Data locality through caching is the solution
to efficient software distribution. There are a number of opportunities
to cache package downloads and thereby (1) reduce bandwidth
requirements, and (2) reduce build times:

  • ~/.cache/pip – This does not persist across hermetically isolated container invocations
  • Network-local package repository mirror
  • Container image

There are many package mirroring solutions for Python packages
and other packages and containers:

  • A full mirror
    • bandersnatch: [new user]
  • A partial mirror:
    • pulp: [new user]
      • Pulp also handles RPM, Debian, Puppet, Docker, and OSTree
  • A transparent proxy cache mirror
    • devpi: [new user]
    • Dumb HTTPS cache with maximum filesize:
      • squid?
  • IPFS
    • IPFS for software package repository mirroring is an active area of
      research.

Containers:

  • OCI Container Registry
    • Notary (TUF): [new user]
    • Amazon Elastic Container Registry: [new user]
    • Azure Container Registry: [new user]
    • Docker registry: [new user]
    • DockerHub: [new user]
    • GitLab Container Registry:
      [new user]
    • Google Container Registry: [new user]
    • RedHat Quay Container Registry: [new user]
  • Container Build Services
    • Any CI Service can be used to build and upload a container

There are approaches to making individual (containerized) (Python)
software package builds more efficient:

A. Build a named container image containing the necessary dependencies,
upload the container image to a container registry,
reuse the container image for subsequent builds of your
package(s)
B. Automate updates of pinned dependency versions using a
free or paid service that regularly audits dependency specifications
stored in source code repositories and sends pull requests
to update the pinned versions.
C. Create a multi-stage Dockerfile that downloads all of the
(version-pinned) dependencies
in an initial stage and COPY those into a later stage which builds
and tests the package under test

  • TODO: what’s the best way to do this?

D. Use a docker image as a cache

  • This requires DOCKER_BUILDKIT=1 to be set
    so that # syntax=docker/dockerfile:experimental
    and RUN --mount=type=cache,target=/root/.cache/pip work
  • TODO: what’s the best way to do this?
  • “build time only -v option”
    [new user]

E. Use a container build tool that supports mounting volumes at build
time (podman, buildah,) and mount in the ~/.cache/pip directory
for all builds so that your build doesn’t need to re-download
everything for PyPI on every CI build.

Security Implications

  • Any external dependency is a security risk
  • When software dependencies are not cached,
    the devops workflow cannot run when the external dependency is
    unavailable.
  • TUF (The Update Framework) may help mitigate cache-poisoning risks.
    PyPI and CNCF Notary implement cryptographic signatures with TUF:
    The Update Framework.

How to Teach This

  • A more detailed guide detailing how to do multi-stage builds that
    cache dependencies?
  • Update packaging.python.org?
  • Expand upon the instructions herein

Reference Implementation

  • Does anyone have examples of CI services that are doing this well
    / correctly? E.g. with proxy-caching on by default

Rejected Ideas

[Why certain ideas that were brought while discussing this PEP were not ultimately pursued.]

Open Issues

  • Request for guidelines, tools, and best practices.
  • Does this need to be a PEP?
    • No: It’s merely an informational advisory and a request
      for consideration of sustainable resource utilization practices.
    • Yes: It might as well be maintained as the document to be
      sent to CI services which are unnecessarily using significant
      amounts of bandwidth.

References

[A collection of URLs used as references through the PEP.]

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

(I’ve also posted this draft to pypa-dev at https://groups.google.com/forum/#!topic/pypa-dev/Pdnoi8UeFZ8; where I didn’t need to replace links with [new user] )

  • Does any of the PyPI TUF work change how (CI Provider) mirrors should be operated?
  • What are the best practices for CI builds that depend upon PyPI?
1 Like

Azure Artifacts has the ability to cache PyPI packages during Azure dev-ops builds. See the documentation

2 Likes

Another data point: looking at the statistics on PyPI downloads it seems AWS-related libraries botocore, s3transfer, and boto3core are among the top-downloaded libraries. Perhaps focusing on best-practices for spinning up fully-provisioned cloud instances (I guess this is “D” ?) might give an immense return on investment.

They are dependency of the awscli. Many people do pip install awscli.
But awscli v2 is released. It doesn’t use PyPI. It uses bundled installer instead.
So I expect download number of awscli will be reduced dramatically in later of this year.

1 Like

In response to this, I created a simple package-index proxy server: https://github.com/EpicWink/proxpi/

The idea is to deploy this server on CI runners, then re-route PyPI (and other extra indices’) requests via this proxy server. It will reduce both index request strain and file download bandwidth. I’ve successfully tested it on our private GitLab CI runners, without having to change any of the project code or CI configuration.

Awesome!

Do you support details like the data-requires-python element from PEP 503? I know there are projects that have had issues as a result of other proxies not handling this, so it would be great to include it.

Yes. All anchor attributes (which can be the requires-Python and the has-GPG attribute) and the URL fragment (which is an optional hash) from the original indices are automatically passed through. I should make a note of this in the README

1 Like