New PyPI Statistics - Simple API Requests

dstufft · February 6, 2019, 2:47pm

As folks are likely aware, PyPI has been logging file downloads into BigQuery for quite some time now, which has enabled people to peek into how people are using/downloading packages from PyPI. Last night I turned on the switch to a new table in BigQuery, called the-psf.pypi.simple_requests, which is now logging all requests to /simple/{project}/ into BigQuery as well which should now be available for all to query.

This table largely mimics the existing table for downloads, just with a different event source, however it has a few important differences:

Instead of sharding data across multiple tables (and requiring wild cards or TABLE_DATE_RANGE to query the table, there is only a single table now, which is automatically sharded by the timestamp column. Queries will only query shards that are not excluded using a WHERE timestamp <filter> clause.
Due to the above, it would otherwise be easy to accidentally trigger a scan against all partitions (and thus get billed for querying all of the data, instead of a limited subset of it). This new table requires the use of filtering by timestamp in the WHERE clause to make it harder to get it “wrong”. The syntax is SQL so you can just do WHERE timestamp > TIMESTAMP("2019-01-01") if you truly want to scan the entire thing.
The file.* columns are gone (since there’s no file being downloaded!).
A top level projectcolumn has been added which contains the PEP 503 normalized name of the project whose page has been queried.
- The BigQuery table has also been setup to cluster on the project column, which means it acts as a secondary partitioning key, so if you filter a query by project, your query will scan even less data.

Otherwise, the schema and the meaning of each column is the same as in the existing table.

cobradford · June 19, 2020, 6:02pm

Does the new simple_requests table create records upon additional events? I’m seeing nearly double the amount of records here than the previous multiple tables. I’m only filtering by date (table suffix in the previous data) and something like "WHERE project = 'foo' AND details.installer.name = 'pip' " OR "file.project = 'foo' AND details.installer.name = 'pip' ". Any insight as to why I’d see such large differences? There are clearly records in the new simple_request table that don’t seem present in the previous. What is the suggested approach to determine number of installs? Thanks!

dstufft · June 25, 2020, 4:33pm

The old table are people downloading files, this new table is people accessing /simple/{foo}/.

encukou · June 26, 2020, 11:39am

Differences I can think of:

caching clients like pip or devpi will (AFAIK) check /simple/{foo}/ for new versions, and not download the file if they have it already
clients that just gather data, but don’t install anything, might use /simple/{foo}/ but not download files