As folks are likely aware, PyPI has been logging file downloads into BigQuery for quite some time now, which has enabled people to peek into how people are using/downloading packages from PyPI. Last night I turned on the switch to a new table in BigQuery, called the-psf.pypi.simple_requests
, which is now logging all requests to /simple/{project}/
into BigQuery as well which should now be available for all to query.
This table largely mimics the existing table for downloads, just with a different event source, however it has a few important differences:
- Instead of sharding data across multiple tables (and requiring wild cards or
TABLE_DATE_RANGE
to query the table, there is only a single table now, which is automatically sharded by thetimestamp
column. Queries will only query shards that are not excluded using aWHERE timestamp <filter>
clause. - Due to the above, it would otherwise be easy to accidentally trigger a scan against all partitions (and thus get billed for querying all of the data, instead of a limited subset of it). This new table requires the use of filtering by
timestamp
in theWHERE
clause to make it harder to get it “wrong”. The syntax is SQL so you can just doWHERE timestamp > TIMESTAMP("2019-01-01")
if you truly want to scan the entire thing. - The
file.*
columns are gone (since there’s no file being downloaded!). - A top level
project
column has been added which contains the PEP 503 normalized name of the project whose page has been queried.- The BigQuery table has also been setup to cluster on the
project
column, which means it acts as a secondary partitioning key, so if you filter a query byproject
, your query will scan even less data.
- The BigQuery table has also been setup to cluster on the
Otherwise, the schema and the meaning of each column is the same as in the existing table.