Getting full list of published packages from pypi.org

I have noted that most recently added packages are browsable/searchable on pypi dot org, and they can be found on PyPI newest packages however they are not listed on Simple index .

Is this working as expected ?
Is there any alternative method to get an accurate list of all the packages ?

1 Like

My understanding is that PyPI caches /simple/ for ~10 minutes. Any package published on PyPI is still accessible under /simple/<packagename> ~immediately.

I can’t answer your question with any authority (I’m not an authority about anything, let alone PyPI), though I suspect the simple index is probably generated on a periodic basis rather than on-demand. Looking at the page source, I wonder if it would be possible to include a meta tag with the time the file was generated, something like:

<meta name="pypi:repository-version" content="1.0">
<meta name="pypi:generate-time" content="2023-01-30 12:00 UTC">
1 Like

It is, see https://pypi.org/simple/ is no longer updating · Issue #11935 · pypi/warehouse · GitHub and related issues.

It can be as much as 24 hours.

It depends. What is your use case?

1 Like

I am creating a metadata cache for pypi packages and I am considering to build a web frontend which will provide certain capabilities not available in pypy.org and also have a CLI capable of searching for packages using an api provided from this newer cache.

Having a potential 24h of lag for finding new/update packages would greatly decrease the useful of such a tool.

Is there any API which allows to fetch the list of packages updated/added since serial X ? This would allow to do incremental updates of the cache which would reduce both traffic and consumption.

Depending on how you plan to use this, you might want to use the public BigQuery dataset for PyPI metadata. This is updated near-instantly.

This is what our mirrors use the XML-RPC API for. Note that while this says it’s deprecated, we don’t really have a replacement for it at this time.

There’s also GitHub - sethmlarson/pypi-data: Data about packages and maintainers on PyPI if you’re okay with slightly stale data.

The BigQuery dataset would force the use of GCP which I would like to avoid for now.

I did not look into XML-RPC on the assumption that the removal was imminent, based on your description I guess it is the best option for my use case.

Regarding the initial list retrieval I see two options, using /simple or the XML-RPC function list_packages_with_serial(), in case you have some insight for this questions:

  1. Any strong recommendation for one over the other ?
  2. In order to determine the “last serial” from the list packages can I assume it to be the max(serial) obtained from list_packages_with_serial() ?

Thanks for the detailed feedback.

Thanks for sharing, I was thinking on building something similar but with more frequent (incremental) updates. I will get some inspiration from that pypi-data work.

Looking at the pypi-data I also found https://deps.dev/_/s/pypi/p/package-name/v/ which seems to provide most of the metadata that I need. I will try to reach Google to understand what are the use terms/limits on the usage of the api.

Thanks.

Since you have to sign up for a Google account there’s no real limits beyond what they give you for free and then paying for more. The terms of what you can do with the data would come from PyPI since it’s our data and I don’t think there are any restrictions.

I was referring to the use terms for https://deps.dev/_/s/pypi/p/flask/v which can be used anonymously (no google account involved).

Regarding GCP the concern is not with the cost, but from the extra requirement of enabling and managing a 3rd party service for the simple purpose of obtaining (open) metadata.

Thanks

This is my team :slight_smile: Feel free to reach out for more information: dii@google.com

1 Like