New PyPI Statistics - Simple API Requests

(Donald Stufft) #1

As folks are likely aware, PyPI has been logging file downloads into BigQuery for quite some time now, which has enabled people to peek into how people are using/downloading packages from PyPI. Last night I turned on the switch to a new table in BigQuery, called the-psf.pypi.simple_requests, which is now logging all requests to /simple/{project}/ into BigQuery as well which should now be available for all to query.

This table largely mimics the existing table for downloads, just with a different event source, however it has a few important differences:

  • Instead of sharding data across multiple tables (and requiring wild cards or TABLE_DATE_RANGE to query the table, there is only a single table now, which is automatically sharded by the timestamp column. Queries will only query shards that are not excluded using a WHERE timestamp <filter> clause.
  • Due to the above, it would otherwise be easy to accidentally trigger a scan against all partitions (and thus get billed for querying all of the data, instead of a limited subset of it). This new table requires the use of filtering by timestamp in the WHERE clause to make it harder to get it “wrong”. The syntax is SQL so you can just do WHERE timestamp > TIMESTAMP("2019-01-01") if you truly want to scan the entire thing.
  • The file.* columns are gone (since there’s no file being downloaded!).
  • A top level projectcolumn has been added which contains the PEP 503 normalized name of the project whose page has been queried.
    • The BigQuery table has also been setup to cluster on the project column, which means it acts as a secondary partitioning key, so if you filter a query by project, your query will scan even less data.

Otherwise, the schema and the meaning of each column is the same as in the existing table.

Packaging and Python 2