I have noted that most recently added packages are browsable/searchable on pypi dot org, and they can be found on PyPI newest packages however they are not listed on Simple index .
Is this working as expected ?
Is there any alternative method to get an accurate list of all the packages ?
My understanding is that PyPI caches /simple/ for ~10 minutes. Any package published on PyPI is still accessible under /simple/<packagename> ~immediately.
I can’t answer your question with any authority (I’m not an authority about anything, let alone PyPI), though I suspect the simple index is probably generated on a periodic basis rather than on-demand. Looking at the page source, I wonder if it would be possible to include a meta tag with the time the file was generated, something like:
I am creating a metadata cache for pypi packages and I am considering to build a web frontend which will provide certain capabilities not available in pypy.org and also have a CLI capable of searching for packages using an api provided from this newer cache.
Having a potential 24h of lag for finding new/update packages would greatly decrease the useful of such a tool.
Is there any API which allows to fetch the list of packages updated/added since serial X ? This would allow to do incremental updates of the cache which would reduce both traffic and consumption.
This is what our mirrors use the XML-RPC API for. Note that while this says it’s deprecated, we don’t really have a replacement for it at this time.
The BigQuery dataset would force the use of GCP which I would like to avoid for now.
I did not look into XML-RPC on the assumption that the removal was imminent, based on your description I guess it is the best option for my use case.
Regarding the initial list retrieval I see two options, using /simple or the XML-RPC function list_packages_with_serial(), in case you have some insight for this questions:
Any strong recommendation for one over the other ?
In order to determine the “last serial” from the list packages can I assume it to be the max(serial) obtained from list_packages_with_serial() ?
Thanks for sharing, I was thinking on building something similar but with more frequent (incremental) updates. I will get some inspiration from that pypi-data work.
Looking at the pypi-data I also found https://deps.dev/_/s/pypi/p/package-name/v/ which seems to provide most of the metadata that I need. I will try to reach Google to understand what are the use terms/limits on the usage of the api.
Since you have to sign up for a Google account there’s no real limits beyond what they give you for free and then paying for more. The terms of what you can do with the data would come from PyPI since it’s our data and I don’t think there are any restrictions.
Regarding GCP the concern is not with the cost, but from the extra requirement of enabling and managing a 3rd party service for the simple purpose of obtaining (open) metadata.